分类目录归档:BioInformatics

Deep Learning System Improves Breast Cancer Detection

Researchers from Beth Israel Deaconess Medical Center (BIDMC) and Harvard Medical School have developed a deep learning approach to read and interpret pathology images.

Trained on Tesla K80 GPUs with the cuDNN-accelerated Caffe deep learning framework, their system achieved 92 percent accuracy at identifying breast cancer in images of lymph nodes which earned them the top prize in two separate categories at the annual International Symposium of Biomedical Imaging (ISBI) challenge. The team also published a paper detailing more of their work.

For the slide-based classification task, human pathologists were accurate 96 percent of the time.

DL Breast Cancer Detection Image

The framework used for breast cancer detection.

Andrew Beck from BIDMC said what’s truly exciting is that 99.5 percent accuracy can be achieved when the pathologists’ analysis and results from the deep learning system are used together. He added, “Our results in the ISBI competition show that what the computer is doing is genuinely intelligent and that the combination of human and computer interpretations will result in more precise and more clinically valuable diagnoses to guide treatment decisions.”

Awesome Machine Learning

Table of Contents

APL

General-Purpose Machine Learning

  • naive-apl – Naive Bayesian Classifier implementation in APL

C

General-Purpose Machine Learning

  • Recommender – A C library for product recommendations/suggestions using collaborative filtering (CF).
  • Darknet – Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation.

Computer Vision

  • CCV – C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library
  • VLFeat – VLFeat is an open and portable library of computer vision algorithms, which has Matlab toolbox

Speech Recognition

  • HTK -The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models.

C++

Computer Vision

  • OpenCV – OpenCV has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS.
  • DLib – DLib has C++ and Python interfaces for face detection and training general object detectors.
  • EBLearn – Eblearn is an object-oriented C++ library that implements various machine learning models
  • VIGRA – VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes of arbitrary dimensionality with Python bindings.

General-Purpose Machine Learning

  • mlpack – A scalable C++ machine learning library
  • DLib – A suite of ML tools designed to be easy to imbed in other applications
  • encog-cpp
  • shark
  • Vowpal Wabbit (VW) – A fast out-of-core learning system.
  • sofia-ml – Suite of fast incremental algorithms.
  • Shogun – The Shogun Machine Learning Toolbox
  • Caffe – A deep learning framework developed with cleanliness, readability, and speed in mind. [DEEP LEARNING]
  • CXXNET – Yet another deep learning framework with less than 1000 lines core code [DEEP LEARNING]
  • XGBoost – A parallelized optimized general purpose gradient boosting library.
  • CUDA – This is a fast C++/CUDA implementation of convolutional [DEEP LEARNING]
  • Stan – A probabilistic programming language implementing full Bayesian statistical inference with Hamiltonian Monte Carlo sampling
  • BanditLib – A simple Multi-armed Bandit library.
  • Timbl – A software package/C++ library implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification, and IGTree, a decision-tree approximation of IB1-IG. Commonly used for NLP.
  • Disrtibuted Machine learning Tool Kit (DMTK) – A distributed machine learning (parameter server) framework by Microsoft. Enables training models on large data sets across multiple machines. Current tools bundled with it include: LightLDA and Distributed (Multisense) Word Embedding.
  • igraph – General purpose graph library
  • Warp-CTC – A fast parallel implementation of Connectionist Temporal Classification (CTC), on both CPU and GPU.
  • CNTK – The Computational Network Toolkit (CNTK) by Microsoft Research, is a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph.
  • DeepDetect – A machine learning API and server written in C++11. It makes state of the art machine learning easy to work with and integrate into existing applications.
  • Fido – A highly-modular C++ machine learning library for embedded electronics and robotics.
  • DSSTNE – A software library created by Amazon for training and deploying deep neural networks using GPUs which emphasizes speed and scale over experimental flexibility.
  • Intel(R) DAAL – A high performance software library developed by Intel and optimized for Intel’s architectures. Library provides algorithmic building blocks for all stages of data analytics and allows to process data in batch, online and distributed modes.

Natural Language Processing

  • MIT Information Extraction Toolkit – C, C++, and Python tools for named entity recognition and relation extraction
  • CRF++ – Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks.
  • CRFsuite – CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
  • BLLIP Parser – BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
  • colibri-core – C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
  • ucto – Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.
  • libfolia – C++ library for the FoLiA format
  • frog – Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.
  • MeTAMeTA : ModErn Text Analysis is a C++ Data Sciences Toolkit that facilitates mining big text data.

Speech Recognition

  • Kaldi – Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers.

Sequence Analysis

  • ToPS – This is an objected-oriented framework that facilitates the integration of probabilistic models for sequences over a user defined alphabet.

Gesture Detection

  • grt – The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.

Common Lisp

General-Purpose Machine Learning

  • mgl – Neural networks (boltzmann machines, feed-forward and recurrent nets), Gaussian Processes
  • mgl-gpr – Evolutionary algorithms
  • cl-libsvm – Wrapper for the libsvm support vector machine library

Clojure

Natural Language Processing

  • Clojure-openNLP – Natural Language Processing in Clojure (opennlp)
  • Infections-clj – Rails-like inflection library for Clojure and ClojureScript

General-Purpose Machine Learning

  • Touchstone – Clojure A/B testing library
  • Clojush – The Push programming language and the PushGP genetic programming system implemented in Clojure
  • Infer – Inference and machine learning in clojure
  • Clj-ML – A machine learning library for Clojure built on top of Weka and friends
  • Encog – Clojure wrapper for Encog (v3) (Machine-Learning framework that specializes in neural-nets)
  • Fungp – A genetic programming library for Clojure
  • Statistiker – Basic Machine Learning algorithms in Clojure.
  • clortex – General Machine Learning library using Numenta’s Cortical Learning Algorithm
  • comportex – Functionally composable Machine Learning library using Numenta’s Cortical Learning Algorithm

Data Analysis / Data Visualization

  • Incanter – Incanter is a Clojure-based, R-like platform for statistical computing and graphics.
  • PigPen – Map-Reduce for Clojure.
  • Envision – Clojure Data Visualisation library, based on Statistiker and D3

Elixir

General-Purpose Machine Learning

  • Simple Bayes – A Simple Bayes / Naive Bayes implementation in Elixir.

Natural Language Processing

  • Stemmer – An English (Porter2) stemming implementation in Elixir.

Erlang

General-Purpose Machine Learning

  • Disco – Map Reduce in Erlang

Go

Natural Language Processing

  • go-porterstemmer – A native Go clean room implementation of the Porter Stemming algorithm.
  • paicehusk – Golang implementation of the Paice/Husk Stemming Algorithm.
  • snowball – Snowball Stemmer for Go.
  • go-ngram – In-memory n-gram index with compression.

General-Purpose Machine Learning

  • gago – Multi-population, flexible, parallel genetic algorithm.
  • Go Learn – Machine Learning for Go
  • go-pr – Pattern recognition package in Go lang.
  • go-ml – Linear / Logistic regression, Neural Networks, Collaborative Filtering and Gaussian Multivariate Distribution
  • bayesian – Naive Bayesian Classification for Golang.
  • go-galib – Genetic Algorithms library written in Go / golang
  • Cloudforest – Ensembles of decision trees in go/golang.
  • gobrain – Neural Networks written in go
  • GoNN – GoNN is an implementation of Neural Network in Go Language, which includes BPNN, RBF, PCN
  • MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.

Data Analysis / Data Visualization

  • go-graph – Graph library for Go/golang language.
  • SVGo – The Go Language library for SVG generation
  • RF – Random forests implementation in Go

Haskell

General-Purpose Machine Learning

  • haskell-ml – Haskell implementations of various ML algorithms.
  • HLearn – a suite of libraries for interpreting machine learning models according to their algebraic structure.
  • hnn – Haskell Neural Network library.
  • hopfield-networks – Hopfield Networks for unsupervised learning in Haskell.
  • caffegraph – A DSL for deep neural networks
  • LambdaNet – Configurable Neural Networks in Haskell

Java

Natural Language Processing

  • Cortical.io – Retina: an API performing complex NLP operations (disambiguation, classification, streaming text filtering, etc…) as quickly and intuitively as the brain.
  • CoreNLP – Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words
  • Stanford Parser – A natural language parser is a program that works out the grammatical structure of sentences
  • Stanford POS Tagger – A Part-Of-Speech Tagger (POS Tagger
  • Stanford Name Entity Recognizer – Stanford NER is a Java implementation of a Named Entity Recognizer.
  • Stanford Word Segmenter – Tokenization of raw text is a standard pre-processing step for many NLP tasks.
  • Tregex, Tsurgeon and Semgrex – Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for “tree regular expressions”).
  • Stanford Phrasal: A Phrase-Based Translation System
  • Stanford English Tokenizer – Stanford Phrasal is a state-of-the-art statistical phrase-based machine translation system, written in Java.
  • Stanford Tokens Regex – A tokenizer divides text into a sequence of tokens, which roughly correspond to “words”
  • Stanford Temporal Tagger – SUTime is a library for recognizing and normalizing time expressions.
  • Stanford SPIED – Learning entities from unlabeled text starting with seed sets using patterns in an iterative fashion
  • Stanford Topic Modeling Toolbox – Topic modeling tools to social scientists and others who wish to perform analysis on datasets
  • Twitter Text Java – A Java implementation of Twitter’s text processing library
  • MALLET – A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
  • OpenNLP – a machine learning based toolkit for the processing of natural language text.
  • LingPipe – A tool kit for processing text using computational linguistics.
  • ClearTK – ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA.
  • Apache cTAKES – Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text.
  • ClearNLP – The ClearNLP project provides software and resources for natural language processing. The project started at the Center for Computational Language and EducAtion Research, and is currently developed by the Center for Language and Information Research at Emory University. This project is under the Apache 2 license.
  • CogcompNLP – This project collects a number of core libraries for Natural Language Processing (NLP) developed in the University of Illinois’ Cognitive Computation Group, for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that support writing NLP applications, running experiments, etc, illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages.

General-Purpose Machine Learning

  • aerosolve – A machine learning library by Airbnb designed from the ground up to be human friendly.
  • Datumbox – Machine Learning framework for rapid development of Machine Learning and Statistical applications
  • ELKI – Java toolkit for data mining. (unsupervised: clustering, outlier detection etc.)
  • Encog – An advanced neural network and machine learning framework. Encog contains classes to create a wide variety of networks, as well as support classes to normalize and process data for these neural networks. Encog trains using multithreaded resilient propagation. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train neural networks.
  • FlinkML in Apache Flink – Distributed machine learning library in Flink
  • H2O – ML engine that supports distributed learning on Hadoop, Spark or your laptop via APIs in R, Python, Scala, REST/JSON.
  • htm.java – General Machine Learning library using Numenta’s Cortical Learning Algorithm
  • java-deeplearning – Distributed Deep Learning Platform for Java, Clojure,Scala
  • Mahout – Distributed machine learning
  • Meka – An open source implementation of methods for multi-label classification and evaluation (extension to Weka).
  • MLlib in Apache Spark – Distributed machine learning library in Spark
  • Neuroph – Neuroph is lightweight Java neural network framework
  • ORYX – Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization for real-time large-scale machine learning.
  • Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface to plug-in different stream processing platforms.
  • RankLib – RankLib is a library of learning to rank algorithms
  • rapaio – statistics, data mining and machine learning toolbox in Java
  • RapidMiner – RapidMiner integration into Java code
  • Stanford Classifier – A classifier is a machine learning tool that will take data items and place them into one of k classes.
  • SmileMiner – Statistical Machine Intelligence & Learning Engine
  • SystemML – flexible, scalable machine learning (ML) language.
  • WalnutiQ – object oriented model of the human brain
  • Weka – Weka is a collection of machine learning algorithms for data mining tasks
  • LBJava – Learning Based Java is a modeling language for the rapid development of software systems, offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer’s application.

Speech Recognition

  • CMU Sphinx – Open Source Toolkit For Speech Recognition purely based on Java speech recognition library.

Data Analysis / Data Visualization

  • Flink – Open source platform for distributed stream and batch data processing.
  • Hadoop – Hadoop/HDFS
  • Spark – Spark is a fast and general engine for large-scale data processing.
  • Storm – Storm is a distributed realtime computation system.
  • Impala – Real-time Query for Hadoop
  • DataMelt – Mathematics software for numeric computation, statistics, symbolic calculations, data analysis and data visualization.
  • Dr. Michael Thomas Flanagan’s Java Scientific Library

Deep Learning

  • Deeplearning4j – Scalable deep learning for industry with parallel GPUs

Javascript

Natural Language Processing

  • Twitter-text – A JavaScript implementation of Twitter’s text processing library
  • NLP.js – NLP utilities in javascript and coffeescript
  • natural – General natural language facilities for node
  • Knwl.js – A Natural Language Processor in JS
  • Retext – Extensible system for analyzing and manipulating natural language
  • TextProcessing – Sentiment analysis, stemming and lemmatization, part-of-speech tagging and chunking, phrase extraction and named entity recognition.
  • NLP Compromise – Natural Language processing in the browser

Data Analysis / Data Visualization

  • D3.js
  • High Charts
  • NVD3.js
  • dc.js
  • chartjs
  • dimple
  • amCharts
  • D3xter – Straight forward plotting built on D3
  • statkit – Statistics kit for JavaScript
  • datakit – A lightweight framework for data analysis in JavaScript
  • science.js – Scientific and statistical computing in JavaScript.
  • Z3d – Easily make interactive 3d plots built on Three.js
  • Sigma.js – JavaScript library dedicated to graph drawing.
  • C3.js– customizable library based on D3.js for easy chart drawing.
  • Datamaps– Customizable SVG map/geo visualizations using D3.js.
  • ZingChart– library written on Vanilla JS for big data visualization.
  • cheminfo – Platform for data visualization and analysis, using the visualizer project.

General-Purpose Machine Learning

  • Convnet.js – ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]
  • Clusterfck – Agglomerative hierarchical clustering implemented in Javascript for Node.js and the browser
  • Clustering.js – Clustering algorithms implemented in Javascript for Node.js and the browser
  • Decision Trees – NodeJS Implementation of Decision Tree using ID3 Algorithm
  • DN2A – Digital Neural Networks Architecture
  • figue – K-means, fuzzy c-means and agglomerative clustering
  • Node-fann – FANN (Fast Artificial Neural Network Library) bindings for Node.js
  • Kmeans.js – Simple Javascript implementation of the k-means algorithm, for node.js and the browser
  • LDA.js – LDA topic modeling for node.js
  • Learning.js – Javascript implementation of logistic regression/c4.5 decision tree
  • Machine Learning – Machine learning library for Node.js
  • mil-tokyo – List of several machine learning libraries
  • Node-SVM – Support Vector Machine for nodejs
  • Brain – Neural networks in JavaScript [Deprecated]
  • Bayesian-Bandit – Bayesian bandit implementation for Node and the browser.
  • Synaptic – Architecture-free neural network library for node.js and the browser
  • kNear – JavaScript implementation of the k nearest neighbors algorithm for supervised learning
  • NeuralN – C++ Neural Network library for Node.js. It has advantage on large dataset and multi-threaded training.
  • kalman – Kalman filter for Javascript.
  • shaman – node.js library with support for both simple and multiple linear regression.
  • ml.js – Machine learning and numerical analysis tools for Node.js and the Browser!
  • Pavlov.js – Reinforcement learning using Markov Decision Processes
  • MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.

Misc

  • sylvester – Vector and Matrix math for JavaScript.
  • simple-statistics – A JavaScript implementation of descriptive, regression, and inference statistics. Implemented in literate JavaScript with no dependencies, designed to work in all modern browsers (including IE) as well as in node.js.
  • regression-js – A javascript library containing a collection of least squares fitting methods for finding a trend in a set of data.
  • Lyric – Linear Regression library.
  • GreatCircle – Library for calculating great circle distance.

Julia

General-Purpose Machine Learning

  • MachineLearning – Julia Machine Learning library
  • MLBase – A set of functions to support the development of machine learning algorithms
  • PGM – A Julia framework for probabilistic graphical models.
  • DA – Julia package for Regularized Discriminant Analysis
  • Regression – Algorithms for regression analysis (e.g. linear regression and logistic regression)
  • Local Regression – Local regression, so smooooth!
  • Naive Bayes – Simple Naive Bayes implementation in Julia
  • Mixed Models – A Julia package for fitting (statistical) mixed-effects models
  • Simple MCMC – basic mcmc sampler implemented in Julia
  • Distance – Julia module for Distance evaluation
  • Decision Tree – Decision Tree Classifier and Regressor
  • Neural – A neural network in Julia
  • MCMC – MCMC tools for Julia
  • Mamba – Markov chain Monte Carlo (MCMC) for Bayesian analysis in Julia
  • GLM – Generalized linear models in Julia
  • Online Learning
  • GLMNet – Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
  • Clustering – Basic functions for clustering data: k-means, dp-means, etc.
  • SVM – SVM’s for Julia
  • Kernal Density – Kernel density estimators for julia
  • Dimensionality Reduction – Methods for dimensionality reduction
  • NMF – A Julia package for non-negative matrix factorization
  • ANN – Julia artificial neural networks
  • Mocha – Deep Learning framework for Julia inspired by Caffe
  • XGBoost – eXtreme Gradient Boosting Package in Julia
  • ManifoldLearning – A Julia package for manifold learning and nonlinear dimensionality reduction
  • MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.
  • Merlin – Flexible Deep Learning Framework in Julia
  • ROCAnalysis – Receiver Operating Characteristics and functions for evaluation probabilistic binary classifiers
  • GaussianMixtures – Large scale Gaussian Mixture Models
  • ScikitLearn – Julia implementation of the scikit-learn API

Natural Language Processing

Data Analysis / Data Visualization

  • Graph Layout – Graph layout algorithms in pure Julia
  • Data Frames Meta – Metaprogramming tools for DataFrames
  • Julia Data – library for working with tabular data in Julia
  • Data Read – Read files from Stata, SAS, and SPSS
  • Hypothesis Tests – Hypothesis tests for Julia
  • Gadfly – Crafty statistical graphics for Julia.
  • Stats – Statistical tests for Julia
  • RDataSets – Julia package for loading many of the data sets available in R
  • DataFrames – library for working with tabular data in Julia
  • Distributions – A Julia package for probability distributions and associated functions.
  • Data Arrays – Data structures that allow missing values
  • Time Series – Time series toolkit for Julia
  • Sampling – Basic sampling algorithms for Julia

Misc Stuff / Presentations

  • DSP – Digital Signal Processing (filtering, periodograms, spectrograms, window functions).
  • JuliaCon Presentations – Presentations for JuliaCon
  • SignalProcessing – Signal Processing tools for Julia
  • Images – An image library for Julia

Lua

General-Purpose Machine Learning

  • Torch7
    • cephes – Cephes mathematical functions library, wrapped for Torch. Provides and wraps the 180+ special mathematical functions from the Cephes mathematical library, developed by Stephen L. Moshier. It is used, among many other places, at the heart of SciPy.
    • autograd – Autograd automatically differentiates native Torch code. Inspired by the original Python version.
    • graph – Graph package for Torch
    • randomkit – Numpy’s randomkit, wrapped for Torch
    • signal – A signal processing toolbox for Torch-7. FFT, DCT, Hilbert, cepstrums, stft
    • nn – Neural Network package for Torch
    • torchnet – framework for torch which provides a set of abstractions aiming at encouraging code re-use as well as encouraging modular programming
    • nngraph – This package provides graphical computation for nn library in Torch7.
    • nnx – A completely unstable and experimental package that extends Torch’s builtin nn library
    • rnn – A Recurrent Neural Network library that extends Torch’s nn. RNNs, LSTMs, GRUs, BRNNs, BLSTMs, etc.
    • dpnn – Many useful features that aren’t part of the main nn package.
    • dp – A deep learning library designed for streamlining research and development using the Torch7 distribution. It emphasizes flexibility through the elegant use of object-oriented design patterns.
    • optim – An optimization library for Torch. SGD, Adagrad, Conjugate-Gradient, LBFGS, RProp and more.
    • unsup – A package for unsupervised learning in Torch. Provides modules that are compatible with nn (LinearPsd, ConvPsd, AutoEncoder, …), and self-contained algorithms (k-means, PCA).
    • manifold – A package to manipulate manifolds
    • svm – Torch-SVM library
    • lbfgs – FFI Wrapper for liblbfgs
    • vowpalwabbit – An old vowpalwabbit interface to torch.
    • OpenGM – OpenGM is a C++ library for graphical modeling, and inference. The Lua bindings provide a simple way of describing graphs, from Lua, and then optimizing them with OpenGM.
    • sphagetti – Spaghetti (sparse linear) module for torch7 by @MichaelMathieu
    • LuaSHKit – A lua wrapper around the Locality sensitive hashing library SHKit
    • kernel smoothing – KNN, kernel-weighted average, local linear regression smoothers
    • cutorch – Torch CUDA Implementation
    • cunn – Torch CUDA Neural Network Implementation
    • imgraph – An image/graph library for Torch. This package provides routines to construct graphs on images, segment them, build trees out of them, and convert them back to images.
    • videograph – A video/graph library for Torch. This package provides routines to construct graphs on videos, segment them, build trees out of them, and convert them back to videos.
    • saliency – code and tools around integral images. A library for finding interest points based on fast integral histograms.
    • stitch – allows us to use hugin to stitch images and apply same stitching to a video sequence
    • sfm – A bundle adjustment/structure from motion package
    • fex – A package for feature extraction in Torch. Provides SIFT and dSIFT modules.
    • OverFeat – A state-of-the-art generic dense feature extractor
  • Numeric Lua
  • Lunatic Python
  • SciLua
  • Lua – Numerical Algorithms
  • Lunum

Demos and Scripts

  • Core torch7 demos repository.
    • linear-regression, logistic-regression
    • face detector (training and detection as separate demos)
    • mst-based-segmenter
    • train-a-digit-classifier
    • train-autoencoder
    • optical flow demo
    • train-on-housenumbers
    • train-on-cifar
    • tracking with deep nets
    • kinect demo
    • filter-bank visualization
    • saliency-networks
  • Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)
  • Music Tagging – Music Tagging scripts for torch7
  • torch-datasets – Scripts to load several popular datasets including:
    • BSR 500
    • CIFAR-10
    • COIL
    • Street View House Numbers
    • MNIST
    • NORB
  • Atari2600 – Scripts to generate a dataset with static frames from the Arcade Learning Environment

Matlab

Computer Vision

  • Contourlets – MATLAB source code that implements the contourlet transform and its utility functions.
  • Shearlets – MATLAB code for shearlet transform
  • Curvelets – The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed to represent images at different scales and different angles.
  • Bandlets – MATLAB code for bandlet transform
  • mexopencv – Collection and a development kit of MATLAB mex functions for OpenCV library

Natural Language Processing

  • NLP – An NLP library for Matlab

General-Purpose Machine Learning

Data Analysis / Data Visualization

  • matlab_gbl – MatlabBGL is a Matlab package for working with graphs.
  • gamic – Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGL’s mex functions.

.NET

Computer Vision

  • OpenCVDotNet – A wrapper for the OpenCV project to be used with .NET applications.
  • Emgu CV – Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows, Linus, Mac OS X, iOS, and Android.
  • AForge.NET – Open source C# framework for developers and researchers in the fields of Computer Vision and Artificial Intelligence. Development has now shifted to GitHub.
  • Accord.NET – Together with AForge.NET, this library can provide image processing and computer vision algorithms to Windows, Windows RT and Windows Phone. Some components are also available for Java and Android.

Natural Language Processing

  • Stanford.NLP for .NET – A full port of Stanford NLP packages to .NET and also available precompiled as a NuGet package.

General-Purpose Machine Learning

  • Accord-Framework -The Accord.NET Framework is a complete framework for building machine learning, computer vision, computer audition, signal processing and statistical applications.
  • Accord.MachineLearning – Support Vector Machines, Decision Trees, Naive Bayesian models, K-means, Gaussian Mixture models and general algorithms such as Ransac, Cross-validation and Grid-Search for machine-learning applications. This package is part of the Accord.NET Framework.
  • DiffSharp – An automatic differentiation (AD) library providing exact and efficient derivatives (gradients, Hessians, Jacobians, directional derivatives, and matrix-free Hessian- and Jacobian-vector products) for machine learning and optimization applications. Operations can be nested to any level, meaning that you can compute exact higher-order derivatives and differentiate functions that are internally making use of differentiation, for applications such as hyperparameter optimization.
  • Vulpes – Deep belief and deep learning implementation written in F# and leverages CUDA GPU execution with Alea.cuBase.
  • Encog – An advanced neural network and machine learning framework. Encog contains classes to create a wide variety of networks, as well as support classes to normalize and process data for these neural networks. Encog trains using multithreaded resilient propagation. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train neural networks.
  • Neural Network Designer – DBMS management system and designer for neural networks. The designer application is developed using WPF, and is a user interface which allows you to design your neural network, query the network, create and configure chat bots that are capable of asking questions and learning from your feed back. The chat bots can even scrape the internet for information to return in their output as well as to use for learning.

Data Analysis / Data Visualization

  • numl – numl is a machine learning library intended to ease the use of using standard modeling techniques for both prediction and clustering.
  • Math.NET Numerics – Numerical foundation of the Math.NET project, aiming to provide methods and algorithms for numerical computations in science, engineering and every day use. Supports .Net 4.0, .Net 3.5 and Mono on Windows, Linux and Mac; Silverlight 5, WindowsPhone/SL 8, WindowsPhone 8.1 and Windows 8 with PCL Portable Profiles 47 and 344; Android/iOS with Xamarin.
  • Sho – Sho is an interactive environment for data analysis and scientific computing that lets you seamlessly connect scripts (in IronPython) with compiled code (in .NET) to enable fast and flexible prototyping. The environment includes powerful and efficient libraries for linear algebra as well as data visualization that can be used from any .NET language, as well as a feature-rich interactive shell for rapid development.

Objective C

General-Purpose Machine Learning

  • YCML – A Machine Learning framework for Objective-C and Swift (OS X / iOS).
  • MLPNeuralNet – Fast multilayer perceptron neural network library for iOS and Mac OS X. MLPNeuralNet predicts new examples by trained neural network. It is built on top of the Apple’s Accelerate Framework, using vectorized operations and hardware acceleration if available.
  • MAChineLearning – An Objective-C multilayer perceptron library, with full support for training through backpropagation. Implemented using vDSP and vecLib, it’s 20 times faster than its Java equivalent. Includes sample code for use from Swift.
  • BPN-NeuralNetwork – It implemented 3 layers neural network ( Input Layer, Hidden Layer and Output Layer ) and it named Back Propagation Neural Network (BPN). This network can be used in products recommendation, user behavior analysis, data mining and data analysis.
  • Multi-Perceptron-NeuralNetwork – it implemented multi-perceptrons neural network (ニューラルネットワーク) based on Back Propagation Neural Network (BPN) and designed unlimited-hidden-layers.
  • KRHebbian-Algorithm – It is a non-supervisor and self-learning algorithm (adjust the weights) in neural network of Machine Learning.
  • KRKmeans-Algorithm – It implemented K-Means the clustering and classification algorithm. It could be used in data mining and image compression.
  • KRFuzzyCMeans-Algorithm – It implemented Fuzzy C-Means (FCM) the fuzzy clustering / classification algorithm on Machine Learning. It could be used in data mining and image compression.

OCaml

General-Purpose Machine Learning

  • Oml – A general statistics and machine learning library.
  • GPR – Efficient Gaussian Process Regression in OCaml.
  • Libra-Tk – Algorithms for learning and inference with discrete probabilistic models.

PHP

Natural Language Processing

  • jieba-php – Chinese Words Segmentation Utilities.

General-Purpose Machine Learning

  • PredictionBuilder – A library for machine learning that builds predictions using a linear regression.

Python

Computer Vision

  • Scikit-Image – A collection of algorithms for image processing in Python.
  • SimpleCV – An open source computer vision framework that gives access to several high-powered computer vision libraries, such as OpenCV. Written on Python and runs on Mac, Windows, and Ubuntu Linux.
  • Vigranumpy – Python bindings for the VIGRA C++ computer vision library.
  • OpenFace – Free and open source face recognition with deep neural networks.
  • PCV – Open source Python module for computer vision

Natural Language Processing

  • NLTK – A leading platform for building Python programs to work with human language data.
  • Pattern – A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
  • Quepy – A python framework to transform natural language questions to queries in a database query language
  • TextBlob – Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
  • YAlign – A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.
  • jieba – Chinese Words Segmentation Utilities.
  • SnowNLP – A library for processing Chinese text.
  • spammy – A library for email Spam filtering built on top of nltk
  • loso – Another Chinese segmentation library.
  • genius – A Chinese segment base on Conditional Random Field.
  • KoNLPy – A Python package for Korean natural language processing.
  • nut – Natural language Understanding Toolkit
  • Rosetta – Text processing tools and wrappers (e.g. Vowpal Wabbit)
  • BLLIP Parser – Python bindings for the BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)
  • PyNLPl – Python Natural Language Processing Library. General purpose NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably for FoLiA, but also ARPA language models, Moses phrasetables, GIZA++ alignments.
  • python-ucto – Python binding to ucto (a unicode-aware rule-based tokenizer for various languages)
  • python-frog – Python binding to Frog, an NLP suite for Dutch. (pos tagging, lemmatisation, dependency parsing, NER)
  • python-zpar – Python bindings for ZPar, a statistical part-of-speech-tagger, constiuency parser, and dependency parser for English.
  • colibri-core – Python binding to C++ library for extracting and working with with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
  • spaCy – Industrial strength NLP with Python and Cython.
  • PyStanfordDependencies – Python interface for converting Penn Treebank trees to Stanford Dependencies.
  • Distance – Levenshtein and Hamming distance computation
  • Fuzzy Wuzzy – Fuzzy String Matching in Python
  • jellyfish – a python library for doing approximate and phonetic matching of strings.
  • editdistance – fast implementation of edit distance
  • textacy – higher-level NLP built on Spacy

General-Purpose Machine Learning

  • machine learning – automated build consisting of a web-interface, and set of programmatic-interface API, for support vector machines. Corresponding dataset(s) are stored into a SQL database, then generated model(s) used for prediction(s), are stored into a NoSQL datastore.
  • XGBoost – Python bindings for eXtreme Gradient Boosting (Tree) Library
  • Bayesian Methods for Hackers – Book/iPython notebooks on Probabilistic Programming in Python
  • Featureforge A set of tools for creating and testing machine learning features, with a scikit-learn compatible API
  • MLlib in Apache Spark – Distributed machine learning library in Spark
  • scikit-learn – A Python module for machine learning built on top of SciPy.
  • metric-learn – A Python module for metric learning.
  • SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book “Artificial Intelligence, a Modern Approach”. It focuses on providing an easy to use, well documented and tested library.
  • astroML – Machine Learning and Data Mining for Astronomy.
  • graphlab-create – A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.
  • BigML – A library that contacts external servers.
  • pattern – Web mining module for Python.
  • NuPIC – Numenta Platform for Intelligent Computing.
  • Pylearn2 – A Machine Learning library based on Theano.
  • keras – Modular neural network library based on Theano.
  • Lasagne – Lightweight library to build and train neural networks in Theano.
  • hebel – GPU-Accelerated Deep Learning Library in Python.
  • Chainer – Flexible neural network framework
  • gensim – Topic Modelling for Humans.
  • topik – Topic modelling toolkit
  • PyBrain – Another Python Machine Learning Library.
  • Brainstorm – Fast, flexible and fun neural networks. This is the successor of PyBrain.
  • Crab – A flexible, fast recommender engine.
  • python-recsys – A Python library for implementing a Recommender System.
  • thinking bayes – Book on Bayesian Analysis
  • Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python. [DEEP LEARNING]
  • Bolt – Bolt Online Learning Toolbox
  • CoverTree – Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree
  • nilearn – Machine learning for NeuroImaging in Python
  • imbalanced-learn – Python module to perform under sampling and over sampling with various techniques.
  • Shogun – The Shogun Machine Learning Toolbox
  • Pyevolve – Genetic algorithm framework.
  • Caffe – A deep learning framework developed with cleanliness, readability, and speed in mind.
  • breze – Theano based library for deep and recurrent neural networks
  • pyhsmm – library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.
  • mrjob – A library to let Python program run on Hadoop.
  • SKLL – A wrapper around scikit-learn that makes it simpler to conduct experiments.
  • neurolabhttps://github.com/zueve/neurolab
  • Spearmint – Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper: Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle and Ryan P. Adams. Advances in Neural Information Processing Systems, 2012.
  • Pebl – Python Environment for Bayesian Learning
  • Theano – Optimizing GPU-meta-programming code generating array oriented optimizing math compiler in Python
  • TensorFlow – Open source software library for numerical computation using data flow graphs
  • yahmm – Hidden Markov Models for Python, implemented in Cython for speed and efficiency.
  • python-timbl – A Python extension module wrapping the full TiMBL C++ programming interface. Timbl is an elaborate k-Nearest Neighbours machine learning toolkit.
  • deap – Evolutionary algorithm framework.
  • pydeep – Deep Learning In Python
  • mlxtend – A library consisting of useful tools for data science and machine learning tasks.
  • neon – Nervana’s high-performance Python-based Deep Learning framework [DEEP LEARNING]
  • Optunity – A library dedicated to automated hyperparameter optimization with a simple, lightweight API to facilitate drop-in replacement of grid search.
  • Neural Networks and Deep Learning – Code samples for my book “Neural Networks and Deep Learning” [DEEP LEARNING]
  • Annoy – Approximate nearest neighbours implementation
  • skflow – Simplified interface for TensorFlow, mimicking Scikit Learn.
  • TPOT – Tool that automatically creates and optimizes machine learning pipelines using genetic programming. Consider it your personal data science assistant, automating a tedious part of machine learning.
  • pgmpy A python library for working with Probabilistic Graphical Models.
  • DIGITS – The Deep Learning GPU Training System (DIGITS) is a web application for training deep learning models.
  • Orange – Open source data visualization and data analysis for novices and experts.
  • MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.
  • milk – Machine learning toolkit focused on supervised classification.
  • TFLearn – Deep learning library featuring a higher-level API for TensorFlow.
  • REP – an IPython-based environment for conducting data-driven research in a consistent and reproducible way. REP is not trying to substitute scikit-learn, but extends it and provides better user experience.

Data Analysis / Data Visualization

  • SciPy – A Python-based ecosystem of open-source software for mathematics, science, and engineering.
  • NumPy – A fundamental package for scientific computing with Python.
  • Numba – Python JIT (just in time) complier to LLVM aimed at scientific Python by the developers of Cython and NumPy.
  • NetworkX – A high-productivity software for complex networks.
  • igraph – binding to igraph library – General purpose graph library
  • Pandas – A library providing high-performance, easy-to-use data structures and data analysis tools.
  • Open Mining – Business Intelligence (BI) in Python (Pandas web interface)
  • PyMC – Markov Chain Monte Carlo sampling toolkit.
  • zipline – A Pythonic algorithmic trading library.
  • PyDy – Short for Python Dynamics, used to assist with workflow in the modeling of dynamic motion based around NumPy, SciPy, IPython, and matplotlib.
  • SymPy – A Python library for symbolic mathematics.
  • statsmodels – Statistical modeling and econometrics in Python.
  • astropy – A community Python library for Astronomy.
  • matplotlib – A Python 2D plotting library.
  • bokeh – Interactive Web Plotting for Python.
  • plotly – Collaborative web plotting for Python and matplotlib.
  • vincent – A Python to Vega translator.
  • d3py – A plotting library for Python, based on D3.js.
  • ggplot – Same API as ggplot2 for R.
  • ggfortify – Unified interface to ggplot2 popular R packages.
  • Kartograph.py – Rendering beautiful SVG maps in Python.
  • pygal – A Python SVG Charts Creator.
  • PyQtGraph – A pure-python graphics and GUI library built on PyQt4 / PySide and NumPy.
  • pycascading
  • Petrel – Tools for writing, submitting, debugging, and monitoring Storm topologies in pure Python.
  • Blaze – NumPy and Pandas interface to Big Data.
  • emcee – The Python ensemble sampling toolkit for affine-invariant MCMC.
  • windML – A Python Framework for Wind Energy Analysis and Prediction
  • vispy – GPU-based high-performance interactive OpenGL 2D/3D data visualization library
  • cerebro2 A web-based visualization and debugging platform for NuPIC.
  • NuPIC Studio An all-in-one NuPIC Hierarchical Temporal Memory visualization and debugging super-tool!
  • SparklingPandas Pandas on PySpark (POPS)
  • Seaborn – A python visualization library based on matplotlib
  • bqplot – An API for plotting in Jupyter (IPython)
  • pastalog – Simple, realtime visualization of neural network training performance.
  • caravel – A data exploration platform designed to be visual, intuitive, and interactive.
  • Dora – Tools for exploratory data analysis in Python.
  • Ruffus – Computation Pipeline library for python.
  • SOMPY – Self Organizing Map written in Python (Uses neural networks for data analysis).
  • HDBScan – implementation of the hdbscan algorithm in Python – used for clustering

Misc Scripts / iPython Notebooks / Codebases

Neural networks

  • Neural networks – NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

Kaggle Competition Source Code

Ruby

Natural Language Processing

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
  • Ruby Linguistics – Linguistics is a framework for building linguistic utilities for Ruby objects in any language. It includes a generic language-independent front end, a module for mapping language codes into language names, and a module which contains various English-language utilities.
  • Stemmer – Expose libstemmer_c to Ruby
  • Ruby Wordnet – This library is a Ruby interface to WordNet
  • Raspel – raspell is an interface binding for ruby
  • UEA Stemmer – Ruby port of UEALite Stemmer – a conservative stemmer for search and indexing
  • Twitter-text-rb – A library that does auto linking and extraction of usernames, lists and hashtags in tweets

General-Purpose Machine Learning

Data Analysis / Data Visualization

  • rsruby – Ruby – R bridge
  • data-visualization-ruby – Source code and supporting content for my Ruby Manor presentation on Data Visualisation with Ruby
  • ruby-plot – gnuplot wrapper for ruby, especially for plotting roc curves into svg files
  • plot-rb – A plotting library in Ruby built on top of Vega and D3.
  • scruffy – A beautiful graphing toolkit for Ruby
  • SciRuby
  • Glean – A data management tool for humans
  • Bioruby
  • Arel

Misc

Rust

General-Purpose Machine Learning

  • deeplearn-rs – deeplearn-rs provides simple networks that use matrix multiplication, addition, and ReLU under the MIT license.
  • rustlearn – a machine learning framework featuring logistic regression, support vector machines, decision trees and random forests.
  • rusty-machine – a pure-rust machine learning library.
  • leaf – open source framework for machine intelligence, sharing concepts from TensorFlow and Caffe. Available under the MIT license. [Deprecated]
  • RustNN – RustNN is a feedforward neural network library.

R

General-Purpose Machine Learning

  • ahaz – ahaz: Regularization for semiparametric additive hazards regression
  • arules – arules: Mining Association Rules and Frequent Itemsets
  • bigrf – bigrf: Big Random Forests: Classification and Regression Forests for Large Data Sets
  • bigRR – bigRR: Generalized Ridge Regression (with special advantage for p >> n cases)
  • bmrm – bmrm: Bundle Methods for Regularized Risk Minimization Package
  • Boruta – Boruta: A wrapper algorithm for all-relevant feature selection
  • bst – bst: Gradient Boosting
  • C50 – C50: C5.0 Decision Trees and Rule-Based Models
  • caret – Classification and Regression Training: Unified interface to ~150 ML algorithms in R.
  • caretEnsemble – caretEnsemble: Framework for fitting multiple caret models as well as creating ensembles of such models.
  • Clever Algorithms For Machine Learning
  • CORElearn – CORElearn: Classification, regression, feature evaluation and ordinal evaluation
  • CoxBoost – CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks
  • Cubist – Cubist: Rule- and Instance-Based Regression Modeling
  • e1071 – e1071: Misc Functions of the Department of Statistics (e1071), TU Wien
  • earth – earth: Multivariate Adaptive Regression Spline Models
  • elasticnet – elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA
  • ElemStatLearn – ElemStatLearn: Data sets, functions and examples from the book: “The Elements of Statistical Learning, Data Mining, Inference, and Prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman Prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman
  • evtree – evtree: Evolutionary Learning of Globally Optimal Trees
  • forecast – forecast: Timeseries forecasting using ARIMA, ETS, STLM, TBATS, and neural network models
  • forecastHybrid – forecastHybrid: Automatic ensemble and cross validation of ARIMA, ETS, STLM, TBATS, and neural network models from the “forecast” package
  • fpc – fpc: Flexible procedures for clustering
  • frbs – frbs: Fuzzy Rule-based Systems for Classification and Regression Tasks
  • GAMBoost – GAMBoost: Generalized linear and additive models by likelihood based boosting
  • gamboostLSS – gamboostLSS: Boosting Methods for GAMLSS
  • gbm – gbm: Generalized Boosted Regression Models
  • glmnet – glmnet: Lasso and elastic-net regularized generalized linear models
  • glmpath – glmpath: L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model
  • GMMBoost – GMMBoost: Likelihood-based Boosting for Generalized mixed models
  • grplasso – grplasso: Fitting user specified models with Group Lasso penalty
  • grpreg – grpreg: Regularization paths for regression models with grouped covariates
  • h2o – A framework for fast, parallel, and distributed machine learning algorithms at scale — Deeplearning, Random forests, GBM, KMeans, PCA, GLM
  • hda – hda: Heteroscedastic Discriminant Analysis
  • Introduction to Statistical Learning
  • ipred – ipred: Improved Predictors
  • kernlab – kernlab: Kernel-based Machine Learning Lab
  • klaR – klaR: Classification and visualization
  • lars – lars: Least Angle Regression, Lasso and Forward Stagewise
  • lasso2 – lasso2: L1 constrained estimation aka ‘lasso’
  • LiblineaR – LiblineaR: Linear Predictive Models Based On The Liblinear C/C++ Library
  • LogicReg – LogicReg: Logic Regression
  • Machine Learning For Hackers
  • maptree – maptree: Mapping, pruning, and graphing tree models
  • mboost – mboost: Model-Based Boosting
  • medley – medley: Blending regression models, using a greedy stepwise approach
  • mlr – mlr: Machine Learning in R
  • mvpart – mvpart: Multivariate partitioning
  • ncvreg – ncvreg: Regularization paths for SCAD- and MCP-penalized regression models
  • nnet – nnet: Feed-forward Neural Networks and Multinomial Log-Linear Models
  • oblique.tree – oblique.tree: Oblique Trees for Classification Data
  • pamr – pamr: Pam: prediction analysis for microarrays
  • party – party: A Laboratory for Recursive Partytioning
  • partykit – partykit: A Toolkit for Recursive Partytioning
  • penalized – penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model
  • penalizedLDA – penalizedLDA: Penalized classification using Fisher’s linear discriminant
  • penalizedSVM – penalizedSVM: Feature Selection SVM using penalty functions
  • quantregForest – quantregForest: Quantile Regression Forests
  • randomForest – randomForest: Breiman and Cutler’s random forests for classification and regression
  • randomForestSRC – randomForestSRC: Random Forests for Survival, Regression and Classification (RF-SRC)
  • rattle – rattle: Graphical user interface for data mining in R
  • rda – rda: Shrunken Centroids Regularized Discriminant Analysis
  • rdetools – rdetools: Relevant Dimension Estimation (RDE) in Feature Spaces
  • REEMtree – REEMtree: Regression Trees with Random Effects for Longitudinal (Panel) Data
  • relaxo – relaxo: Relaxed Lasso
  • rgenoud – rgenoud: R version of GENetic Optimization Using Derivatives
  • rgp – rgp: R genetic programming framework
  • Rmalschains – Rmalschains: Continuous Optimization using Memetic Algorithms with Local Search Chains (MA-LS-Chains) in R
  • rminer – rminer: Simpler use of data mining methods (e.g. NN and SVM) in classification and regression
  • ROCR – ROCR: Visualizing the performance of scoring classifiers
  • RoughSets – RoughSets: Data Analysis Using Rough Set and Fuzzy Rough Set Theories
  • rpart – rpart: Recursive Partitioning and Regression Trees
  • RPMM – RPMM: Recursively Partitioned Mixture Model
  • RSNNS – RSNNS: Neural Networks in R using the Stuttgart Neural Network Simulator (SNNS)
  • RWeka – RWeka: R/Weka interface
  • RXshrink – RXshrink: Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression
  • sda – sda: Shrinkage Discriminant Analysis and CAT Score Variable Selection
  • SDDA – SDDA: Stepwise Diagonal Discriminant Analysis
  • SuperLearner and subsemble – Multi-algorithm ensemble learning packages.
  • svmpath – svmpath: svmpath: the SVM Path algorithm
  • tgp – tgp: Bayesian treed Gaussian process models
  • tree – tree: Classification and regression trees
  • varSelRF – varSelRF: Variable selection using random forests
  • XGBoost.R – R binding for eXtreme Gradient Boosting (Tree) Library
  • Optunity – A library dedicated to automated hyperparameter optimization with a simple, lightweight API to facilitate drop-in replacement of grid search. Optunity is written in Python but interfaces seamlessly to R.
  • igraph – binding to igraph library – General purpose graph library
  • MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.

Data Analysis / Data Visualization

  • ggplot2 – A data visualization package based on the grammar of graphics.

SAS

General-Purpose Machine Learning

  • Enterprise Miner – Data mining and machine learning that creates deployable models using a GUI or code.
  • Factory Miner – Automatically creates deployable machine learning models across numerous market or customer segments using a GUI.

Data Analysis / Data Visualization

  • SAS/STAT – For conducting advanced statistical analysis.
  • University Edition – FREE! Includes all SAS packages necessary for data analysis and visualization, and includes online SAS courses.

High Performance Machine Learning

Natural Language Processing

Demos and Scripts

  • ML_Tables – Concise cheat sheets containing machine learning best practices.
  • enlighten-apply – Example code and materials that illustrate applications of SAS machine learning techniques.
  • enlighten-integration – Example code and materials that illustrate techniques for integrating SAS with other analytics technologies in Java, PMML, Python and R.
  • enlighten-deep – Example code and materials that illustrate using neural networks with several hidden layers in SAS.
  • dm-flow – Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specific data mining topics.

Scala

Natural Language Processing

  • ScalaNLP – ScalaNLP is a suite of machine learning and numerical computing libraries.
  • Breeze – Breeze is a numerical processing library for Scala.
  • Chalk – Chalk is a natural language processing library.
  • FACTORIE – FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.

Data Analysis / Data Visualization

  • MLlib in Apache Spark – Distributed machine learning library in Spark
  • Scalding – A Scala API for Cascading
  • Summing Bird – Streaming MapReduce with Scalding and Storm
  • Algebird – Abstract Algebra for Scala
  • xerial – Data management utilities for Scala
  • simmer – Reduce your data. A unix filter for algebird-powered aggregation.
  • PredictionIO – PredictionIO, a machine learning server for software developers and data engineers.
  • BIDMat – CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis.
  • Wolfe Declarative Machine Learning
  • Flink – Open source platform for distributed stream and batch data processing.
  • Spark Notebook – Interactive and Reactive Data Science using Scala and Spark.

General-Purpose Machine Learning

  • Conjecture – Scalable Machine Learning in Scalding
  • brushfire – Distributed decision tree ensemble learning in Scala
  • ganitha – scalding powered machine learning
  • adam – A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
  • bioscala – Bioinformatics for the Scala programming language
  • BIDMach – CPU and GPU-accelerated Machine Learning Library.
  • Figaro – a Scala library for constructing probabilistic models.
  • H2O Sparkling Water – H2O and Spark interoperability.
  • FlinkML in Apache Flink – Distributed machine learning library in Flink
  • DynaML – Scala Library/REPL for Machine Learning Research
  • Saul – Flexible Declarative Learning-Based Programming.

Swift

General-Purpose Machine Learning

  • Swift AI – Highly optimized artificial intelligence and machine learning library written in Swift.
  • BrainCore – The iOS and OS X neural network framework
  • swix – A bare bones library that includes a general matrix language and wraps some OpenCV for iOS development.
  • DeepLearningKit an Open Source Deep Learning Framework for Apple’s iOS, OS X and tvOS. It currently allows using deep convolutional neural network models trained in Caffe on Apple operating systems.
  • AIToolbox – A toolbox framework of AI modules written in Swift: Graphs/Trees, Linear Regression, Support Vector Machines, Neural Networks, PCA, KMeans, Genetic Algorithms, MDP, Mixture of Gaussians.
  • MLKit – A simple Machine Learning Framework written in Swift. Currently features Simple Linear Regression, Polynomial Regression, and Ridge Regression.

TensorFlow

General-Purpose Machine Learning

Credits

互联网上的基因检测靠谱么?

WeGene这家深圳企业成立不到一年就推出了基因测序服务,200+项检测结果,价格是1299元。高大上的基因测序服务真的走入寻常百姓家?还是你也像我一样,对互联网基因测序睁大了惊讶和疑惑的眼睛?

随着个体化医疗和临床癌症基因组研究的发展,不断有研究结果表明,癌症基因组或其他个体化医疗诊断能达到很好的临床效果。利用网络营销策略可以提高人们对个体化医疗和基因检测的认识,以及扩大基因检测服务的销售面。

但并非所有网络基因检测服务都能起到好的效果。近期《Journal of National Cancer Institute》刊登的一项研究表明,当网络上出现越来越多的检测服务时,随之而来的是更多风险。在网上,基因检测的销售价格范围从99美元到 13000美元不等,部分网站提供目录,网友可以直接在网上下单订购检测产品,获得检测以及基因咨询服务,绕开传统的医生服务而直接获得基因相关解读。

“我认为出现这种现象是由于一方面患者对基因检测服务有需求,另一方面是基因组学技术和靶向治疗所带来的巨大益处。”这项研究的首席研究员 Stacy Gray说,“病人经常拿着打印的临床化验单或者基因检测结果来咨询。因此我们要谨慎对待网上的检测服务。” Gray 和她的同事Katherine Janeway都是达纳法博癌症中心的临床医生。Gray补充说:“在癌症研究领域越来越多地使用基因组学技术治疗疾病,看起来非常有吸引力。”

为了更好地了解网上的癌症个体化检测、诊断结果以及它们是否能用于临床。Gray,Janeway和同事通过互联网搜索和文献查找,综合选取55家提供个人癌症医疗检测服务的网站,其中包括商业公司、学术中心、科研机构,如23andMe,NeoGenomics,Illunima临床服务实验室等。

这些公司和机构在网上提供包括体细胞突变、遗传突变的分析和解释服务,以及其他个体化医疗服务。55家网站中:32家(58%)提供体细胞突变检测或分析服务,11家(20%)提供生殖细胞检测;44%的网站提供个性化癌症护理服务,15%的网站提供检测结果解读服务。这些网站最常提供的癌症检测包括乳腺癌、结肠癌、肺癌等。

研究人员邀请了相关专家组成专家小组,根据(美国疾控中心基因组应用实践和评价预防工作组)制定的标准来判断这些网站提供的检测服务是否合格。如果该专家小组90%的成员一致认为某项检测、诊断服务不合格,那么该服务即确定为不合格。

“我们不像其他的报道那样只关注这些检测是否有科学依据,”Gray说,“我们更注重这些网上检测服务是否具有临床应用价值,请专家组去评估这些检测是否合格。”虽然一些网站详细描述其提供的基因检测服务以及他们使用的检测技术,但专家组发现大部分网站提供的个体化癌症诊断或治疗项目中至少有一项没有临床应用价值。

从整体来看,有些检测确实有临床价值,也应该被用于患者,而有些检测被研究人员认为没有临床价值。例如,许多网站提供的检测中,KRAS和EGFR 突变的检测就是具有明显价值的一类。而全外显子测序和化疗敏感性检测并没有明确的临床价值。全外显子测序是一个很有前景的技术,但是现在还不能证明全外显子测序能够用于癌症的靶向治疗。就化疗敏感性检测而言,美国临床肿瘤协会表示这些检测不具有临床价值,这并不是因为它们不是指导性检测,而是它们违反了检测指导原则,但它们仍然作为癌症个体化治疗检测在网上销售。

此外,研究人员还发现绝大多数检测服务,没有临床应用的证据。当然,临床价值的评判标准是在不断变化的,虽然这些检测不能很好地应用于临床治疗,但有证据表明这些检测对病人的预后有重要影响。

2013年,美国食品药品管理局(FDA)就叫停了23andMe公司的检测服务,理由是该公司尚未获得提供此项服务的许可证。不准确的检测结果可能危害公民健康,引发不必要的手术。虽然研究人员没有评估这类网站销售的索赔有效性,但是有人发现网上基因检测服务的索赔支持很少。因此提供准确的基因检测以及解读服务,显得尤为重要。

在网络上出售基因检测服务,也许只是一种新的营销手段。但政策制定者需要考虑更新监管制度来充分保护公众健康,促进医学创新。

Gray表示,随着网上检测的调查的完成,研究人员正在对网络上的基因组检测和个体化医疗服务进行进一步分类。

肿瘤功能基因的研究套路

这是一篇关于怎么开启科研之旅的小文,写给刚进入实验室还没有头绪的小伙伴。

–by 老谈

导读:芯片筛到了一个分子、老板拍脑袋想到了一个分子、师兄/师姐给我留了一个分子、文献中看到了一个亮点分子、数据库中有个分子也不错······那么问题来了,接下来该怎么办呢?

1、可行性分析

首先要看这个分子是不是具有研究潜力,是不是跟肿瘤的发生发展有关。最简单的方法,一是查文献报道,二是在数据库中查找,看这个基因是不是在肿瘤中有差异表达。

2、排他性查询

首先要看一下这个分子是不是已经被报道过,如果没有报道过,那么恭喜你,你已经通过了第一关。如果已经有报道,要分析,都报道了什么,找出没有报道的点。比如这个分子被报道影响乳腺癌增殖,那你可以考虑其他肿瘤的增殖方向或者其他表型如迁移、耐药等。

3、准备研究工具

要研究一个基因的功能,最常用的方法就是改变这个基因在胞内的表达,看一下细胞表型是否收到变化,常见是敲减、过表达、突变,都可以通过构建相应的载体(也可以购买商业化的克隆),转染目标细胞,实现相应的基因编辑。

4、确认研究表型

这一步,你需要确定这个基因对肿瘤的什么功能有影响。功能表型研究有很多,常见如增殖、转移、耐药,还有难度高一点的血管形成,能量代谢balabala。以增殖为例:通过转染siRNA或能产生siRNA的载体,用CCK8、MTT等方法检测细胞的增殖速度是否发生了变化。

如果有,恭喜你可以进入下一关;如果没有,请试一下转移、耐药等方向,或者,换一个分子!

5、增加实验证据

单单一个CCK8或是MTT实验,不足以说明这个基因对肿瘤增殖的影响,还需要补充一些实验,常见的是细胞周期、细胞凋亡,通常还要检测一些增殖相关的标记分子,如Ki67、p-Akt、PCNA。通常需要做2个以上细胞株。

到这里,如果获得的数据都是有效的,要发个1分的小SCI基本不成问题了。加上临床样本的检测,2分也不是难事。当然我相信大家的追求不会止于此,咱们都是追求CNS级别的人(至少幻想过CNS级别的)!那就继续!

6、体内模型

体内体外是完全不同的两个环境,要追求质量,动物模型的实验是必须的。肿瘤研究最常用是裸鼠、小鼠模型,伟大的研究者们运用强大的基因编辑技术,在这些普通的模型上建立了具有特定表型的转基因动物模型(如MMTV-PyMT转基因小鼠,可自发性产生乳腺癌;更多信息老谈最近要推出的:肿瘤中常用的动物模型)。

7、分子作用机制研究

首先要找到与表型挂钩的下游分子,比如增殖相关Akt、p53、MAPK等信号通路,请期待老谈的再下期的大作(最近被小伙伴们表扬的都不好意思谦虚了,哈哈):信号通路与表型。

怎么找?

保(穷)守(逼)一点,就是经典的几个通路,一个一个试;奔(土)放(豪)一点的,直接上基因/蛋白芯片进行通路筛选。

接着要找到这个分子对这个信号通路调节的方式,如直接调控的分子(酵母双杂、免疫沉淀可以筛选有蛋白相互结合作用的分子,ChIP、RIP可以筛选与蛋白相互结合的DNA、RNA)。

8、临床意义

最后,在大量临床样本中验证目标分子的表达(这个,也可以在第一步就做了。再PS一下,样本量越大越有意义,差异约显著价值越高)。还要(最好)检测一下目标分子调节的下游分子在临床组织中是不是也受到了目标分子的调控。都做到最后这一步了,成功指日可待!你还怕辛苦么?!

别看6、7、8点说起来轻松,做起来可是烧钱更烧脑的事情。当然你要是一直关注解螺旋,想必事半功倍,没那么辛苦哦!

Ribose-seq: 鉴定RNA片段插入基因组DNA序列事件

核糖核苷酸是RNA的基本单位,它们会在DNA复制和修复过程中嵌入基因组DNA,进而影响基因组的稳定性。然而,迄今为止人们还无法鉴定和定位这些插入DNA的核糖核苷酸。

为此,乔治亚理工学院和科罗拉多大学的科学家们开发了一种新测序技术Ribose-seq。该技术可以鉴定和分析插入基因组DNA的核糖核苷酸,适用于包括人类在内的多种生物。这一成果发表在一月二十六日的Nature Methods杂志上。

研究人员利用这一技术在酿酒酵母的细胞核和线粒体DNA中,绘制了核糖核苷酸的完全图谱,鉴定了核糖核苷酸插入的“热点”区域。研究显示,核糖核苷酸嵌入很普遍但并不是随机发生的。

“核糖核苷酸是DNA中丰度最高的非标准核苷酸,但迄今为止人们还无法确定它们的位置和类别,”乔治亚理工学院的副教授Francesca Storici说,他与科罗拉多大学的助理教授Jay Hesselberth共同领导了这项研究。“核糖核苷酸插入会改变DNA的结构和功能。”

核糖核苷酸里的羟基(-OH)能使DNA发生扭曲,形成敏感性位点。值得注意的是,-OH和碱性溶液之间的反应,会让DNA更容易被切割。Ribose-seq就是利用这一反应来检测核糖核苷酸插入事件的。

研究人员先在核糖核苷酸处切割DNA,然后在此基础上构建DNA文库,文库中的DNA序列包含核糖核苷酸插入位点及其上游序列。随后,他们对文库进行高通量测序,将测序读取与参考基因组进行比对,最终获得rNMP插入事件的基因组图谱。

“Ribose-seq能够特异性直接捕捉嵌入DNA的核糖核苷酸,”Storici指出。“这一技术适用于任何基因组DNA(从细胞核基因组、质粒DNA到线粒体DNA),不需要进行标准化。Ribose-seq还可以在DNA遭遇环境压力发生断裂和脱碱基时分析rNMP。”

核糖核苷酸里的羟基是ribose-seq的关键,“-OH是核糖核苷酸特有的”文章的第一作者Kyung Duk Koh说。

研究人员在酿酒酵母中对这一方法进行了验证。“不论是核糖核苷酸的插入位点,还是核糖核苷酸的组成都存在偏好,”Koh说。“我们找到了核糖核苷酸插入基因组的一些热点。”人们可以在此基础上鉴定不稳定的基因组区域,理解它们对DNA性能和活性的影响。

下一步,研究人员将把Ribose-seq用于其它DNA,“这一技术可以用于任何生物的任何细胞类型,只要能提取出基因组DNA,”Koh说。

除了DNA修复和复制以外,药物、环境压力和其它因子造成的损伤也会使核糖核苷酸插入DNA。而Ribose-seq可以帮助人们研究这些过程产生的影响。

“Ribose-seq能让我们更好的理解核糖核苷酸对DNA结构和功能的影响,”Storici说,“鉴定特征性的核糖核苷酸插入,可以找到人类疾病的新生物学指标,比如癌症和退行性疾病。”

参考文献:

Ribose-seq: global mapping of ribonucleotides embedded in genomic DNA

Biotechniques:如何降低下一代测序1%的碱基错误识别?

为什么复杂疾病的全基因组关联研究都是失败的?

下一代测序带来的误差,使得我们很难检测到罕见变异,而这些罕见变异在癌症等疾病中发挥着重要的作用。

在Biotechniques最近的一篇新闻中,Janelle Weaver报道了科学家们在描绘这些误差以及开发策略提高精度方面所取得的进步。

美国国家心脏、肺和血液学研究所(NHLBI)的“Exome Sequencing Project”(外显子组测序项目)的研究人员在对2400个人的外显子组进行测序和分析之后推断:大多数单核苷酸变异是罕见的,在样本群体中的发生率小于0.5%。

这一发现解释了为什么全基因组关联研究(GWAS,常见遗传变异与特定疾病表型相关性)——对于复杂疾病通常是失败的。

Jan Vijg

阿 尔伯特·爱因斯坦医学院(Albert Einstein College of Medicine)的Jan Vijg主要研究”基因组损伤和衰老”之间的关系。他说:“越是罕见的基因突变越发重要,尽管我们不倾向于在常见变异中寻找特定疾病的风险变异,然而很显 然当所有的罕见变异加在一起就会引发了许多疾病表型,因此我们需要研究它们。”

尽管下一代DNA测序对疾病相关突变的检测和个性化医学发展,具有很大的潜力,但是我们检测罕见变异的能力,因样本制备、测序和分析步骤过程中引入的误差而受到限制。所以这导致了有大约1%的碱基是错误识别的。

虽然这种误差率对于某些应用是可以接受的,但是它已经成为癌症研究人员的一个主要障碍。所以现在,研究人员正在开发新的技术,以准确地识别基因组大海中的罕见变异。

目前的方法主要有三点。

NO.1 双重测序——降低误差率

在 华盛顿大学,Larry Loeb是一个研究“罕见遗传变异与癌症”的研究人员。但是由于以前的测序方法不够准确,他的实验室只能研究频率超过10%的突变。他说:“我们需要一种 更准确的方法,可以研究那些可能不会出现在所有肿瘤细胞中的变异——它们可能是亚克隆或随机的。”

Loeb lab的部分成员:Lawrence Loeb, Scott Kennedy

Michael Schmitt, and Jesse Salk

(来源:University of Washington)

Loeb 和他的研究团队描述了这样一种方法,称为双重测序(Duplex Sequencing),相关研究结果发表在2012年的《PNAS》杂志。通过对DNA双链体的两条链进行独立标记和测序,这种方法实现的理论背景误差 率为,小于每十亿个核苷酸序列中有1个人为突变。因此,这种方法对于检测罕见DNA变异以及单分子计数具有很高的灵敏度,可精确地确定DNA或RNA拷贝 数的绝对值。

双重测序(Duplex Sequencing)

在 双重测序中,一段双螺旋DNA片段的两条链,被附加以一段随机的、互补的双链核苷酸序列。首先将一段单股的随机核苷酸序列引入一段接头链,然后用DNA聚 合酶产生一段互补的双链标签,进行延伸,使双链的标签序列被合并到标准的Illumina测序接头上。接着,将标签接头结扎到剪切的DNA上,然后对单独 标记的链,进行PCR扩增和配对末端测序。

通过比较双重测序两股链中每一段所获得的序列,Loeb及其同事能够将测序误差与真正的突变区别开来。由于一个DNA双链体的两股链是互补的,真正的突变位于两股链的同一位置。相反,PCR或测序误差仅在一条链上引发突变,因此可以被视为技术性误差。

Loeb说:“其他最好的方法,可在每一千个核苷酸中引起一个误差。如果你想测定存在于身体所有细胞中的遗传异常,这已经足够好了。但是,如果你想测定罕见变异,或者如果你想要一个肿瘤中的突变分布,或者如果你想测定病毒性种群,这个错误率就太高了。”

NO.2 添加金属螯合剂降低DNA氧化偏差——实现精确度

虽 然科学家们都非常清楚PCR和新一代测序过程中引入的误差,但是DNA提取和样本制备过程中产生的误差却很少受到关注。麻省理工学院布罗德研究所和哈佛大 学的Gad Getz带领的一个研究小组,2013年在《Nucleic Acids Research》发表的一项研究中解决了这个问题,该研究发现了样本制备过程中发生的人为突变的一个新来源。

根据这项 研究,位于超深覆盖目标捕获测序数据中低等位基因片段的C>A/G>T颠换偏差,是来自于包含提取过程反应污染物的样品在进行声剪切时的 DNA氧化。往剪切缓冲液中添加金属螯合剂,可降低这些氧化偏差,一种后处理过滤法能够筛选出氧化引起的测序数据误差。这些研究结果表明,实验室程序的变 化和信息工具的使用,可以帮助研究人员抑制人为偏差的影响。

加州大学旧金山分校的Nadav Ahituv,研究基因调控序列在人类生物学和疾病中的作用,他没有参与这项研究,但是他指出:“人们都知道,使用任何测序技术都有误差,所以我不认为这 有什么惊讶的。这项研究的长处在于,他们针对的是原因,并且提出了很好的计算工具来减少这个问题。”

根据卡罗林斯卡医学 院癌症系统生物学专家Jussi Taipale介绍,除了最近这些研究中描述的实验和信息学方法之外,还有其他潜在的方法可提高测序的准确度。他没有参与这些研究,但是他指出:“精确度 总是受到聚合酶错误率的限制,因为你必须使用它。如果我们能开发一种酶,具有较低的错误率,如果我们能处理突变的所有化学来源,那么这当然会提供更多的帮 助。”

NO.3 临床影响:双重测序可以揭示赋予耐药性的罕见突变

像Vijg这些致力 于罕见突变的研究人员,可能不需要等待太长时间就能实现这些方法。例如,双重测序可以适用于各种测序平台,含有双链标签序列的接头,可以代替标准的测序接 头,且不会明显改变Illumina测序仪样本制备的正常工作流程。Vijg说:“我们肯定能够在当前的工作流程中快速地实现它。”

也 许最重要的是,双重测序可以揭示赋予耐药性的罕见突变。Vijg说:“如果我们已经知道,在肿瘤中的某个地方,有一个特定的基因变异,能够使它们有机会逃 脱一种特定药物,那么我们就可以尝试另外一种药物。这可能会对治疗产生直接影响。”此外,单细胞测序的未来发展,可能会进一步帮助研究人员确定这些类型的 突变。

但是双重测序是否能够应用于许多类型的突变,仍有待确定。“实际上,他们主要将其应用于小突变——点突变。是否能在大的变化上做到这一点,如大的缺失、易位或拷贝数变异,我还不清楚。”

最 后,双重测序可能不会代替标准方法。其中一个原因是,对于全外显子组测序来说,这种方法过于昂贵。Loeb说:“它对于细胞的测序不均匀混合物或提问的生 物学问题,真的很有用,因为它们需要超级精确度。所以,从某种意义上说,双重测序可能会被限制在癌症研究、病毒群、古DNA取证之类的事情。”

Ahituv认为,各种测序方法将被并行使用。他说:“这两篇论文最重要的意义在于,如果我们想利用新一代测序技术来寻找罕见基因变异,我们就必须非常小心。”

苹果CEO库克出柜啦!基因决定的你造吗?

苹果CEO公开承认自己的同性恋。这消息终于解开了很多果粉心中的疑惑,原来这就是你们的iPhone6会弯的原因……
  
苹果CEO库克称为自己是同性恋而自豪。他还表示,如果听说苹果CEO是一名同性恋者能够抚慰其他一些同性恋者,牺牲我个人一点隐私也是值得的。怪不得iPhone的壁纸很奇怪。彩虹色5C 6屏保的菊花,越来越容易弯的iphone早就一直在暗示你们了。

 

那么,同性恋是不是由基因决定的呢?

  同性恋现象自古就有。但是,对同性恋的成因却众说纷纭,莫衷一是。人们试图从生物 学、心理学、社会环境等角度解释同性恋,然而到目前为止,还没有一个证据充分、说服力强的理论来给同性恋的成因下一个定论。现在一般认为,同性恋是先天和 后天因素共同作用的结果,其中,基因的重要性被越来越多的证据所支持。

发现同性恋基因的科学家本人也是同性恋

美国遗传学家DeanHamer首先确定了同性恋基因,根据同性恋亲兄弟的X染色体,他发现顶端的一段基因Xq28决定同性恋。
这篇论文发表在《Science》上后,Hamer成为争议人物,而当人们知道,发表这篇文章的作者本人就是同性恋时,质疑声就更大了。但之后又有几项研究支持了Hamer的结论,也在另外三条染色体上也发现了同性恋基因。

400名同性恋研究发现8号染色体存在同性恋基因

然而今年年初,美国西北大学的一项对400名同性恋DNA的大型研究发现,无法通过Xq28基因来准确地预测同性恋,同时发现第8号染色体上存在同性恋基因,但也同样无法精确预测。

 

果蝇的经过基因修饰就会变成同性恋

虽然分子生物学的研究既没有肯定也没有否定同性恋基因的存在,但来自动物界的观察倾向于认可,因为在动物界中同性恋的存在很普遍,而且对果蝇进行基因修饰后,会使得果蝇变成同性恋,为同性恋基因的存在提供了动物实验依据。

 

韩国科学家分分钟将雌老鼠变拉拉

韩国科学家在2010年报道,胚胎期去除雌性老鼠的一个特殊性基因——FucM,可使雌性老鼠变成拉拉——它们拒绝异性的求爱,并试图与同性交配。原来,FucM基因影响了雌激素水平,进而使大脑受到影响。

同性恋无法生小孩,那基因如何流传下来呢?

同性恋占人群的比例为5%到15%之间,如果真有同性恋基因存在的话,从进化上似乎很难 解释,因为同性恋不会有后代,这些基因不会流传下来,更不可能有这么大的比例。这种基因不利于人类繁衍,是应该被进化淘汰的。那么同性恋的潜在基因倾向究 竟为什么能延续下来?我们仍没有找到答案。不过演化学者提出几个非常有潜力的假说:

亲缘选择假说

科学家们推测,产生利他主义的基因帮助了有遗传关系的亲属,从而使后者的利他基因具有了遗传优势,利他主义便得以延续。同样的 道理也可能适用于同性恋:同性恋个体不用在其自身的繁殖上投入时间和精力,或许他们就能够帮助亲戚养育后代,而最终使这些孩子身上潜在的同性恋倾向基因在 演化中受益。
一项研究以南太平洋萨摩亚群岛的男同性恋者为对象进行了调查。萨摩亚是一个更加传统的社会,当地的男同性恋者被称为 “Fa’afafine”, 不生育后代,完全能被社会全体接受,尤其被他们的血亲家庭所接受。这些男同性恋者对侄(外甥)辈倾注了大量的精力——这些孩子与他们平均有 25% 的基因是相同的。

性别互补选择假说

也许在一个性别——比如男同性恋中损害生存适应性的基因,在女性身上具有增强适应性的作用。有专家认为这是一种“爱男基因”, 主要用途是让有这些基因的女性在性方面更早熟,因此能够生更多的孩子,从而具备进化优势。其代价是如果男性有了这些基因就会变成同性恋。这个理论有证据支 持,在意大利进行的研究发现同性恋男子的女性亲属生孩子的数量是其他妇女的1.3倍。这在选择上是一个巨大的优势,其代价是男性亲属是同性恋,相比之下, 获得的益处大于弊端,因此被进化筛选出来,一代一代地通过X染色体遗传下来。
同理,男性也应该有“爱女基因”,这种基因在男的身上是进化优势,会有更多的后代,但在女的身上就是女同性恋,也同样益处大于弊端,得以遗传下来。这种 “爱女基因”不可能通过性染色体遗传,而会在其他染色体上,因此女同性恋在数量上不如男同性恋,而且其中双性恋的比例较大。

社会声誉假说

有人类学证据表明,在工业化之前的社会,同性恋男性更有可能成为牧师或者祭司,他们的异性恋亲属也因此获得了较高的社会声誉,并因此占有繁殖优势,从而使得任何共有的同性恋倾向基因得到延续。这是一个非常有吸引力的想法,不过也缺乏实证支持。

群体选择假说

大部分生物学家都认为自然选择发生在个体及其基因的层面,而非发生于群体之中。但人类可能是一个例外;或许包含有同性恋个体的 群体比全部是异性恋个体的群体更好。最近,人类学家莎拉•赫迪(Sarah B. Hrdy)等人指出,在人类演化史的大部分时期,养育后代不都是父母的(更不是母亲的)专利,我们的祖先有很多拟母亲行为那些并非孩子双亲的人,尤其是其 他的血缘亲属,参与到了抚养后代的任务中。智人发展出这样一套体系是很有道理的,因为在所有的灵长目动物中,智人的新生儿是最无助的,需要成人投入的精力 也最多。如果种群中有足够多育儿帮手是同性恋者,整个群体都将从中极大地受益。
另一方面,就算人类祖先中的同性恋者并不一定要参与到合作抚养后代的任务中去,他们较少生育(或者干脆就不生),这本身就为其异性恋亲属节省了更多的资 源。还有研究者提出了其他群体层面上的模型,关注社交互动而不是资源利用:同性恋也许与更强的社交性和社会合作有关;它还可能阻止为争夺异性而产生的暴力 竞争。

平衡多态假说

或许同性恋这种遗传倾向与某种或者某几种特定的基因共同起作用时,会因为某种未知的原因 而产生补偿性的益处,比如著名的镰刀形红细胞贫血症(sickle-cell disease ※此处已更改),这种病的致病基因有助于预防疟疾。虽然目前还没有确定哪一段基因是决定性取向的,但我们仍旧不能排除这种平衡多态的可能性。

非适应性的副产物

同性恋行为可能既不是适应性的,也不是不适应的,它可能就是一种非适应性的行为。也就是 说,它也许并没有得到自然选择,而是作为某些优势性状的副产物被保留了下来。这样的优势性状可能是渴望形成配对关系、寻求感情或者生理上的满足,等等。那 么,为什么会存在这样的倾向,为什么人与人之间的亲密关系是愉悦的? 答案很有可能是,在演化进程中,长期的配对关系最有利于个体的成功繁殖。

存在即合理 同性恋基因不会丢失

同性恋的成因依然没有定论,但有两点可以肯定:第一,无论是动物还是人,其性行为的作用都不可能、也不应该是以繁殖为唯一目 的;第二,同性恋基因不会丢失,它已经走过了生物进化的漫漫长路,如果会被淘汰的话,它早就已经没了,不是吗?不必因为同性恋是少数派,就要对其“特别关 照”,存在即合理,在这个孤独的星球上,大家都是一样的。

苹果之父是乔布斯,苹果之娘就是库克啦

 

为了解释苹果6容易变弯,库克也是蛮拼的

循环肿瘤细胞如何实现辩血识癌

                                                                                导读

他眼看着他的兄弟死于癌症,无药可救。现在,世界最具权威的一名癌症专家宣称,到了执行方案B的时候了,已经找到了一种通用检查工具,可以在无症状病人的年度体检中发现其体内癌细胞的分子踪迹。

 

从血液中找到微量肿瘤DNA
    波特·福格斯坦(Bert Vogelstein)所需要的而又令人不安的答案就在血液样本中。

福格斯坦是世界上最富盛名的科学家之一。上世纪八十年代,他和约翰霍普金斯大学的同事发现了DNA是如何经历数十年发生一系列的变异而使细胞发生癌变,被认为是攻克了“癌症的阵地”。福格斯坦对于证实受损DNA是导致的癌症的元凶这一理论做出了重大贡献。

现在,你可以想象你可以通过血液看到这些变异――看到癌症。几乎所有类型的癌症都会向血液中释放DNA,福格斯坦在约翰霍普金斯大学的实验室发明了一种叫做“液体活检”的技术,可以找出癌症遗传物质。

这项技术可以通过仪器对血液样本中的DNA进行快速排序,使研究人员找出即使是极其微量的肿瘤DNA。霍普金斯大学的科学家们与来自巴尔的摩最大的肿瘤治疗中心的医生合作,对一千多份血液样本进行了研究。他们认为,液体活检可在疾病症状出现之前发现癌症。

这有一份特殊的血液样本,它来自福格斯坦的弟弟――比福格斯坦小一岁的整形外科医生。他 患有皮肤癌,而且癌细胞已经扩散。当时他对一种新型药物产生反应,但治疗使他全身肿胀,从X光或CT扫描中很难看出癌细胞是否消失。因此,福格斯坦采用了 这项新技术,如果血液中的肿瘤DNA消失,则他们可以庆祝一下了,如果它仍然存在,他可能要劝他的弟弟改用别的药物,做最后一搏。

“我们曾试过指导治疗。不管怎样,还有一线希望。”福格斯坦声音哽咽的说。他没有告诉我们后来发生了什么。

出生于巴尔的摩的巴力·福格斯坦(Barry Volgelstein)的讣告在2013年7月3日发布。

预防比治疗更有效

在 与癌症的斗争中,我们并没有取得胜利,福格斯坦弟弟的死揭示了原因何在。许多癌症在已经变得无法治愈时才得以发现。每年全世界的癌症药物花费达910亿美 元,但大多数药物对于治疗这些患者为时已晚。最新治疗方法所产生的治疗费用高得离谱,每月需花费1万美元,但常常只能延长生命几个星期。制药公司对晚期癌 症药物的研发和测试远多于其他类型的药物。
“不论是普通民众还是科学家,都执迷于这样一个观念:治愈晚期癌症。”福格斯坦表示, “这是社会上普通采用的方案A,我认为这不是解决办法”。还有其他的方法可以降低癌症的死亡率:涂防晒霜、不吸烟以及进行检查尽早发现癌症。对于福格斯坦 而言,这些预防措施代表“方案B”,这是因为人们对于预防癌症没有给予足够的重视和资金投入。

然而预防工作比任何药物都有效得多。在美国,结肠癌的死亡率为40%,低于1975年的死亡率,这种疾病多半是在结肠镜检查中发现的。同样的,黑色素瘤皮肤癌如果能在早期发现,是可以通过手术治疗的。福格斯坦表示,“我们认为应该把方案B变为方案A。”

循环肿瘤细胞使一切成为可能

新的血液检测方法会使这一切成为可能。霍普金斯大学的研究者们第一次宣称,他们已经找到一种通用检查工具,可以在无症状病人的年度体检中发现其体内癌细胞 的分子踪迹。“我想我们解决了早期检测这一问题。”维克特·威尔克斯库(Victor Velculescu)称,他是霍普金斯大学的研究人员,他的实验室与福格斯坦的实验室仅一楼之隔。
进行癌症常规检查在医学上将是一个挑战。其中的一个困难就是当检查出体内出现癌细胞 DNA时,内科医生可能并不清楚肿瘤到底在哪、危险性有多大,甚至不知道是否值得治疗。“我们必须谨慎对待”,马萨诸塞州总医院癌症中心的主管丹尼尔·哈 勃(Daniel Haber)表示。他认为,DNA血液检测“尚未成熟”,而且还需要大量研究证实其有用性。“还有许多棘手的问题需要解决”,他说。

尽管有人对此有所怀疑,这项技术还是吸引了越来越多的关注。托尼·迪科哈勃(Tony Dickherber)是美国国家癌症研究所创新分子分析技术项目组组长,他表示,通过检测血液来查找肿瘤DNA在三年前“只不过是边缘技术”。但现在从 美国加州到英国伦敦,许多实验室和公司都投入到血液检测技术的改进之中,并积极寻找支持这项技术的新数据。“人们开始相信福格斯坦是对的――这可能是癌症 早期诊断的最佳途径,”他说,“它可能比现有的其他检查技术功能更强大,它筛查的癌症范围之广令人难置信。”

来自霍普金斯大学和其他23家机构的医生们在二月发表了他们的研究成果。他们对患有15 种不同类型癌症的846名病人进行了研究。研究发现,在癌细胞已经扩散的晚期癌症患者中,血液中发现肿瘤DNA的患者比例超过80%,在癌细胞未扩散的早 期癌症患者中,这一比例约为47%。在所有晚期结肠癌患者的血液中都检测到肿瘤DNA。

霍普金斯大学的研究者们第一次宣称,他们已经找到了一种通用检查工具,可以在无症状病人的年度体检中发现其体内癌细胞的分子踪迹。

起初,检测结果可能并不尽如人意。检测结果会经常产生疏漏吗?根据威尔克斯库的说法,血 液检测的益处在于“极其准确”。如果检测出你的血液中确实出现肿瘤DNA,则说明你目前患有癌症。因此,与当前前列腺癌和乳腺癌的检测技术通常会出现假阳 性相比,DNA检测占有优势。“血液中出现循环DNA是正常的;但出现与肿瘤匹配的循环DNA就不正常了”,斯坦福大学外科肿瘤研究所所长斯特芬尼·杰夫 睿(Stefanie Jeffery)表示。

对于福格斯坦而言,血液检测意味着超过半数的癌症有可能在早期得以发现,并可能通过手术得到治愈。他表示,“如果有一种可以治愈一半癌症的药物,那你可以在纽约举行盛大游行了。”

循环肿瘤细胞的早期发现
  尼克森总统于1971签署了“对癌症宣战”法案,那时福格斯坦正在读医学院。由于药物没能使癌症死亡率下降,这使多年的研究毫无成果。现在我们知道了是什 么导致了癌症。福格斯坦与同事肯乃斯·肯斯勒(Kenneth Kinzler)在上世纪八十年代进行的研究证实了基因突变是导致癌症的元凶。科学家们已经发现了150多种致癌基因。虽然癌症的基因图谱非常复杂,但所 有的DNA突变都会导致一个结果:本应死亡的正常细胞持续分裂增生。这种细胞生长失衡就是癌症。

对于制药公司而言,这意味着要花费数十亿美元来开发出治疗晚期癌症的新药。但是对于福格斯坦而言,DNA突变导致癌症这一发现还意味着:在疾病得到诊断之前确定致病的变异是有可能的。在肿瘤学中有个共识:癌症发现越早,生存机会越大。

关于结肠癌,这种类型的癌症是福格斯坦研究最为深入的一种癌症。它起始于叫做APC基因 的单点突变。但是细胞该点的突变转化为具有扩散能力和杀伤能力的DNA突变平均需要30年。每年有大约60万人死于结肠癌。“他们几乎都是因为没有在肿瘤 存在的前27年没有发现患有癌症而死亡的”,福格斯坦说,“在这个过程中我们有足够长的时间进行治疗。”

问题是除了血液检测,没有其他简便的方法来找出这些突变。福格斯坦从上世纪九十年代开始 进行癌症早期检测的研究,他最初使用的是当时的传统方法,从尿液和粪便中查找肿瘤DNA。他认为预防和检查仍然未受到人们的重视,即使在现在,他也是研究 者中的“绝对少数派”。据他估算,药物的研制费用是预防检查费用的100倍。

这就解释了为什么尽管福格斯坦声名卓著却总是看起来一幅愤愤不平的样子。包括其他几位著 名学者在内的霍普金斯大学研究团队发布的新观点总是会驳倒一些主流科学观念。按照实验室的惯例,想要在这里工作的年轻科学家在首次介绍自己的研究工作时一 定要戴上Burger King皇冠。

路易斯·迪亚斯的构想
    实验室有关血液检测的工作在路易斯·迪亚斯(Luis Diaz)的带领下进行,他是一名肿瘤专家,是福格斯坦的得意门生。他在2005想到利用血液检测来查找癌症DNA,那时他正在研究是否可以利用噬肉菌摧 毁肿瘤。这项研究需要将人类癌症转移到老鼠身上,迪亚斯想起他需要一种在不杀死老鼠的情况下监视老鼠体内肿瘤的方法。他和他的同事决定采用血液检测的方 法。很快,他们发现人DNA水平随着治疗的成功或失败而大幅上升或下降。既然能够监测到老鼠体内的DNA水平,是不是也可以监测到人体内的DNA水平呢?

这种想法并不是第一次提出。早在1948年就有人提出在人类的静脉和动脉中存在游离循环DNA,它通常是死亡细胞产生的废物。但肿瘤也会向血液释放DNA。在死于癌症的人体内,这种来自肿瘤的DNA可高达87%,但通常数量极少,令人难以查觉。

当迪亚斯开始探究这个问题时,所有这些理论尚未成形,只是一种模糊的可能性。为了开发液 体活检技术,霍普金斯大学的科学家们必须首先发明一种将肿瘤DNA与大量正常DNA区分开的方法。血样来自迪亚斯当时正在治疗的巴尔的摩结肠癌病人,研究 者们最初只追踪四种癌症基因。他们发现,血液中的DNA在这些病人进行手术或药物治疗后快速消失,甚至在一天内毫无踪迹。对健康对照者的检测从未出现阳性 结果。“我们认识到这种检测可以提出并解答“我患有癌症吗?”这样的问题,”迪亚斯说。

霍普金斯大学的科学家们坚信,这种检测比现有医生所使用的任何一种工具都敏感得多――至 少对于癌细胞还太小以至无法用成像设备检查出来的癌症是这样的。福格斯坦做了这样的估算,肿瘤至少含有1千万个细胞、像大头针的头部那么大时就会释放出可 检测到的DNA。相比之下,肿瘤要100倍于这个尺寸,至少包含10亿个细胞才会在核磁共振检查中显现。

霍普金斯大学的医生已经开始利用DNA检测来确定病人在切除肿瘤后体内是否还留有恶性肿 瘤细胞。参与这项研究的还有皮特·吉布斯(Peter Gibbs),一名澳大利亚肿瘤专家,他们对250个在结肠癌早期接受过手术的病人的血液样本进行了分析。结果显示大多数病人得到治愈,但是30%的病人 结果显示未清除全部肿瘤,有复发的可能。问题是医生不知道哪些病人会出现复发。“外科医生会告诉他们,‘不用担心――肿瘤已经切除了’,”迪亚斯说,“这 让我很沮丧,因为我不得不告诉他们,‘我们真的不知道你是否痊愈’”。这些癌症幸存者觉得惴惴不安,不知道什么时候癌症会卷土重来,到时病情有可能变得更 为凶险。而且这种状态会持续数年。

病人可能会感到恐慌,但医生却不知道该怎么做。“对健康人群进行检查然后告诉他们‘哎呀,看,您身体里有肿瘤,但我们不知道它在哪里’――这么做是行不通的,”一个肿瘤医生这样说。

在手术六周后,澳大利亚的病人进行了肿瘤DNA的血液检查。研究者们表示,到目前为止, 他们已经正确无误地确认了约一半后来癌症复发的病人。福格斯坦表示,这些病人可在将来接受化疗,这可能会使至少三分之一的病人得到治愈。但是这种检查的局 限性也显而易见,仍然有一半癌症复发的病人未被检查出来。

迪亚斯表示,这可能是因为残留的癌细胞未释放出足够的DNA。“或许这已经达到生物极 限,”他说。但是,癌症DNA会随着时间的推移升高到可检测的水平,对病人做定期检查可将其查找出来。尽管霍普金斯大学的检测技术还处于实验阶段,路易 斯·迪亚斯表示他有足够的信心告诉一些患者他们仍未痊愈,而告诉其他的患者他们很可能已经痊愈了。“六到八周以后,我们就可以告诉他们是否得到治愈,”他 说,“这太令人高兴了。”

目标:成为常规癌症检查手段

福格斯坦表示,他的终极目标是使血液检测成为常规癌症检查手段。霍普金斯大学的研究者们对于这个目标的实现表示乐观。他们利用血液样本中的DNA对人类的 整个基因组进行排序,而不是仅仅追踪几个关键癌症基因。这样他们可以计算出遗传物质发生错位或混乱的频率。大量重排DNA的出现是一种分子性副作用,只有 在癌细胞染色体上才会发生这种情况――这是出现癌症的信号。但是测定出一个完整的基因序列花费昂贵。“如果一个人患有癌症,他不介意花费5000美元做基 因测试。但他不会在年度体检中花费1000美元去做这个测试,”福格斯坦说,“我们的目标是将这项检测的费用降低到可在常规检查中使用。”
这需要时间。DNA测序成本已经大幅下降,但是达到100美元这个常规检查可接受的低价 格可能需要10年才能实现。同时,霍普金斯大学对易有患癌症的人进行了多项研究,以确定能否在健康人群中发现早期癌症。其中的一项研究涉及到800名具有 患癌风险的人。在这些特殊病例中,病人的胰腺上长有囊肿,这些囊肿有的会转化成癌症,有的则不会。

胰腺癌是一种最适于通过血液检测进行早期检查的癌症。它并不是很常见的癌症,但却是美国 第四大致命癌症,目前治愈率只有4%。如果能在扩散前得到早期发现,生存率可提高到大约25%。(56岁的苹果公司创始人斯蒂夫·乔布斯死于另外一种类型 的叫做神经内分泌瘤的胰腺癌)。

但是让每个人都能做上DNA检测则是一个巨大飞跃。麻省总医院的肿瘤专家哈勃认为这项技 术在目前可能能够告诉医生病人是否患有癌症。但是不同于成像扫描或活检,它无法明确指出癌症到底在哪个部位。病人会感到恐慌,但医生却不知道该怎么做。 “对健康人群进行检查然后告诉他们‘哎呀,看,您身体里有肿瘤,但我们不知道它在哪里’――这么做是行不通的,”哈勃说。

药物在预测癌症方面效果较差是有例可寻的。例如PSA(前列腺特异抗原)检测,是检测一 种与前列腺癌相关的蛋白质。这种检测时常会出现假阳性结果,而且在它确实检测出来的肿瘤中,一部分肿瘤生长极为缓慢,根本无需治疗。数百万人由于癌症最终 对他们没什么影响而终止了治疗。

据估算,每47名接受前列腺切除手术的病人中只有一人未死于癌症。达特茅斯学院的研究者 们进行的一项研究显示,乳房X线透视也会导致过度诊断和过度治疗。大约25%已经确诊并治疗的乳腺癌是不会引起任何症状的。“你对每一个人都进行检查,然 后由于疾病不会发展或者这人死于其他原因而终止治疗,”达特茅斯学院的健康经济学家琼纳森·斯肯纳(Jonathan Skinner)说,“早期检查所带来的负面影响可能非常大。”

但是霍普金斯大学的威尔克斯库认为癌症DNA普查会成为现实。“如果你不能有所作为,那 你是想一直无知下去,”他说,“我无法想象,了解癌症会对病人毫无帮助。或许我们不会对于每一条信息都做出强烈反应,或许我们不采取任何行动,但是由于检 测简便易行,很容易做到定期检测,然后我们会告诉患者‘让我们看它发展得如何了’。”

到目前为止,尚没有公司考虑对看似健康的病人做大范围的癌症检查。目前,只有迪亚斯和威 尔克斯库新创立的诊断检测公司Personal Genome Diagnostics和其他几家公司,如Boreal Genomics和Guardant Health提供液体活检服务,但是仅面对癌症晚期患者。对于这些患者而言,血液检测可揭示治疗是否有效,如果无效则需尝试采用其他疗法。这项技术的另一 个有价值的应用是追踪那些导致肿瘤产生的特异性DNA突变。由于许多新型癌症药物具有“靶向性”――阻断特异性分子变化过程――只有预计药物会对患者的肿 瘤产生作用时才会对患者施用这些药物。医生已经能够通过组织活检对肿瘤进行DNA检测。但是非侵入性的血液检测更简便也更安全,可以更频繁地对患者进行检 查。由于癌症DNA不断发生变异,血液检测可以帮助患者适时更换药物。

对于Guardant的总裁何尔米·艾特克(Helmy Eltoukhy)来说,液体活检用途广泛,是个“棒极了的想法”。出于商业上和医学上的原因,他的公司目前只针对患有癌症的病人提供服务。但他表示,早 期筛查已在公司的未来规划中。“很显然,它就是圣杯,”他说,“想想它的这些用途,这就是我们一直努力的方向。”

我问过65岁的福格斯坦和44岁的威尔克斯库,他们是否对自己进行过血液检测。他们都给 出了否定的回答。但总的来说,美国人有40%的人可能会患上癌症,而且这种可能性会随着年龄增长而增大。如果研究者们未对这项检测技术进行过研究,公众似 乎也不会急于去做血液检测。如果广泛开展癌症检测,将其作为公共健康措施,则整个医学界必须参与其中,这无疑将耗费大量的时间。

福格斯坦并不认为这些能轻易能够实现。不管怎样,我们仍然需要新型药物治疗已经患上癌症 的人们。但他仍坚信,击败晚期癌症最好的办法就是防患于未然。当我对他弟弟的去世表示同情时,福格斯坦摆了摆双手。“这就是我做这项研究的原因,”他说, “一百年以后,当极少有人因癌症死亡时,绝大部分原因是癌症被早期发现,而不是我们能够治好肿瘤遍布的躯体。”

《JAMA》-临床外显子组测序可诊断25%的疾病

贝勒医学院分子和人类遗传学儿科学系、人类基因组测序中心和德克萨斯大学健康科学中心的研究人员曾经报道称,在 3386名接受全外显子组检测的患者当中,有大约25%的患者被诊断为一种已知的遗传疾病(一种基因突变或疾病相关变异),更加验证了他们一年前在《新英 格兰医学杂志》(New England Journal of Medicine)发表的250例初步报告的诊断率。
Christine Eng也将把目前的结果提交到2014年10月21日圣地亚哥举行的美国人类遗传学协会年会。
本文共同作者、贝勒医学院分子和人类遗传学、儿科学教授James R. Lupski博士指出:“这份报告的结果,总体上将永远改变未来的儿科学和医学实践。将基因组学提升到医生要做的事情名单中之前,只是一个时间问题。这将 是新的‘家族史’,让你获得从自每名患者继承的重要变异和引起疾病易感性的新突变。”

 

事实上,大部分的诊断是在继承一个新突变(卵子或精子中)的患者当中得到的,这个新突变以前没有在其父母中见过。

 

利用新的测序技术(被称为新一代测序)测定患者的DNA,并将其结果与正常参考进行对比。然后,任何的疾病相关突变也与患者的DNA进行比较,以确定是否 患儿从父母遗传了该突变,以更好地了解疾病的原因。在这项研究中,全外显子组测序也确定了医生可以临床干预以缓解或消除负面症状的方法,使家庭获得关于可 能病程的更多信息。

 

除了在更大患者组中证明了25%的诊断率之外,这项最新研究还表明,罕见遗传事件可大规模地引发疾病易感性。

 

疾病的主要原因包括,患者中的新生事件——指首先发生在基因当中的一个单一变化(称为孟德尔突变)、单亲二体(患者继承了来自同一亲本的一个突变的两个拷贝)、嵌合体和拷贝数。

 

德克萨斯儿童医院临床儿科遗传学家Lupski指出:“临床外显子组测序可以协助诊断各种难以诊断的疾病。”本研究中的许多患者来自于德克萨斯儿童医院获美国的其他医疗中心。

 

Lupski说:“罕见变异和孟德尔疾病是疾病人群的重要因素。这与群体遗传学家的思维形成了鲜明的对比,他们通过 全基因组关联研究,调查常见突变事如何引发疾病易感性的。我们发现,‘罕见变异’聚集实际上可大规模地引发疾病易感性。个体疾病可能是罕见的,但是有成千 上万这样的疾病,更多的是通过基因组确定。”

 

贝勒医学院儿科学和分子人类遗传学教授、贝勒医学院癌症遗传学诊所主任、德克萨斯儿童医院癌症中心成员Sharon Plon博士指出:“我预计,在未来几年内,我们将了解全外显子组测序在成人医学和儿科领域以外的重要性。目前正在执行一项NIH支持的临床试验,在儿童 癌症患者中进行全外显子组测序,以了解全外显子组测序对这些患者的潜在有效性。”

 

在对2000名患者的详细研究中,有504名患者接受了分子诊断,其中280名患者具有致病的单个基因突变(常染色 体显性),181名患者是常染色隐性(两个突变基因),65名患者是伴X染色体(X染色体突变),另外1名患者被假定是线粒体遗传。在5份病例中,患者从 其同一亲本遗传了突变基因的两个拷贝(单亲二体)。在显性突变中,有208个是新生突变,不是遗传自父母,有32个是遗传性的,还有40个是不确定的,因 为其父母样本不能用于实验室分析。

 

在新生突变当中,有5个已证明的嵌合体,表明突变发生在受精后。嵌合体意指,患者有一个小的细胞群,具有与身体大多数细胞不同的遗传模式。

 

研究人员在504份病例中发现了708个假定的致病变异等位基因,大多数变异是新的且以前没有报道的。值得注意的是,近30%的诊断发生在研究人员过去3年中确定的疾病基因当中。在65份病例中,除了外显子组测序以外,没有可用的基因检测来发现当时的突变基因。

 

Eng称:“医生通常试图找到一种诊断,来解释一名患者可能会有的所有问题。我们发现,在某些情况下,患者可能会有两种不同疾病的一个混合表型。这种患者可能有两种不同的罕见遗传性疾病,来解释他们的疾病是在使用全外显子组测序之前的一个意外发现。”

 

盘点:2014年癌症研发最热门靶点

2000年后肿瘤信号网络被逐渐阐释、完善,大量的分子靶向药物进入临床研究、走上市场,近年针对受体酪氨酸激酶靶 点如Bcr-Abl、VEGF/VEGFRs、PDGF/PDGFRs、EGFR/HER2、ALk已有多个药物上市,me-too品种的研发逐渐放缓, 但扩展适应症、克服耐药性、优化治疗方案的研究还没有结束。

 

目前肿瘤信号网络中,FGFR、c-Met、HER3、Hedgehog等靶点吸引了不少的研究,但最热的当是 PI3K/Akt/mTOR、Raf/MEK/ERK两条细胞内信号通路。2013年FDA批准了BTK抑制剂ibrutinib,对CLL的疗效很好, 吸引了一些药企开发me-too/me-better药物。

 

涉及细胞周期调控的靶点如Aurora激酶、CDK、ChK也有不少新药在研,最耀眼的无疑是CDK4/6抑制剂, 已经有三个分子推进到后期开发,而Aurora激酶和ChK抑制剂则大多在早期临床失败。针对DNA损伤修复的PARP的药物研发也回暖,而针对蛋白-蛋 白相互左右的新靶点如Bcl-2、MDM2、IAP也有多个分子进入临床研究。

 

特别值得一提的是表观遗传调控剂,早年发现的阿扎胞苷、地西他滨等被证明为DNA甲基转移酶抑制剂,目前研究得最多的是HDAC抑制剂,表观遗传的其他靶点如组蛋白赖氨酸甲基转移酶EZH2、组蛋白H3甲基转移酶DOT1L、溴结构域蛋白BET等也开展了大量基础研究。

 

近来抗癌领域最耀眼的无疑是免疫疗法,调节CTLA4、PD1/PDL1、4-1BB、OX40、CD27等免疫检查点可以激活T细胞免疫应答,而基因工程修饰的CAR、TCR T细胞的应用更是标志着个性化免疫治疗时代的到来。

 

1、Bcr-Abl抑制剂

 

Bcr-Abl抑制剂主要用于治疗慢性粒细胞白血病(CML),目前FDA已经批准伊马替尼、尼罗替尼、达沙替尼、 ponatinib等多个药物,其中第三代Bcr-Abl抑制剂ponatinib可克服T315I耐药突变。我国自主研发的氟马替尼、美迪替尼已经进入 临床研究,广药集团的ponatinib类似物HQP1351即将申报临床。由于已经有多个药物上市,药企基本没有再研发新的Bcr-Abl抑制剂。

 

2、VEGF/VEGFRs抑制剂

 

VEGF/VEGFRs是经典的血管生成信号通路,可用于治疗多种实体瘤和湿性年龄相关性黄斑变性 (AMD),FDA已经批准的针对VEGF/VEGFRs单抗或融合蛋白有贝伐珠单抗、雷珠单抗、阿柏西普、ramucirumab,我国自主研发的康柏 西普(商品名:朗沐)已于2013年上市。

 

针对VEGFR的小分子往往对其他酪氨酸激酶也有抑制作用,这类药物也已经上市了索拉非尼、舒尼替尼等多个,我国也申报了许多类似物。值得注意的是,2014年FDA批准ramucirumab用于治疗胃癌,江苏恒瑞自主研发的阿帕替尼也即将上市。

 

3、PDGF/PDGFRs抑制剂

 

PDGFRs与VEGFRs的相似度较高,很多小分子药物是VEGFRs/PDGFRs同时抑制的,比如索拉非尼、 舒尼替尼、帕唑帕尼。2014年1月Bayer支付2550万美元携手Regeneron,共同开发anti-PDGFRβ单抗,联合阿柏西普用于治疗湿 性AMD;2014年5月Novartis以10.3亿美元从Ophthotech Corporation买下III期anti-PDGF药物Fovista,用于治疗湿性AMD。

 

4、FGF/FGFRs抑制剂

 

FGFRs与VEGFRs、PDGFRs一样,也涉及肿瘤的增殖和血管的形成,但至今仍然没有FGFRs抑制剂上 市。Boehringer Ingelheim研发了VEGFR/PDGFR/FGFR抑制剂nintedanib,用于治疗非小细胞肺癌、特发性肺纤维化,2014年1月获得 FDA突破性药物资格。

 

我国自主研发了FGFRs/VEGFRs抑制剂德立替尼(lucitanib,E-3810, AL3810),几经辗转美国、日本的权益为Clovis Oncology所有,美、日、中以外的权益被Servier收购,目前该药在国内已经申报临床,并且得到了重大新药创制专项的支持。

 

 

5、EGFR/HER2/HER3抑制剂

 

EGFR、HER2、HER3都是ErbB家族酪氨酸激酶,已上市的药物包括anti-EGFR单抗、anti-HER2单抗及ADC、EGFR抑制剂、EGFR/HER2抑制剂,用于治疗非小细胞肺癌、HER2阳性乳腺癌、结直肠癌、头颈癌等实体瘤。

 

第三代EGFR抑制剂可克服T790M耐药突变,AZD9291、CO-1686引起全球的关注,目前都已经获得FDA突破性药物资格。我国自主研发的艾维替尼、迈华替尼也能克服T790M突变,目前已经申报临床。

 

 

6、HGF/c-Met抑制剂

 

c-Met别名HGFR,与其他生长因子受体一样,也是抗癌药研发的热门靶点,已经上市的c-Met抑制剂有克唑替 尼、卡博替尼,但这两个分子抑制c-Met的同时还抑制了其他靶点。onartuzumab、tivantinib治疗非小细胞肺癌的III期临床失败对 选择性c-Met抑制剂的研发是个重大打击,可能需要寻找更好的患者筛选方法或适应症。

 

AstraZeneca从国内和记黄埔医药买下沃利替尼,ASCO2014报道的数据显示,6例乳头状肾细胞癌患者 服用该药后,3例实现部分应答,目前AstraZeneca重点开发该适应症。国内已经有多个c-Met抑制剂申报临床,包括和记黄埔的沃利替尼、贝达药 业的BPI-9016M、北京浦润奥的伯瑞替尼。

 

 

7、ALK抑制剂

 

ALK通过基因融合而激活致癌,70-80%间变性大细胞淋巴瘤存在NPM-ALK融合,6.7%的非小细胞肺癌存在EML4-ALK融合。FDA批准的第一个ALK抑制剂是克唑替尼,用于治疗ALK阳性非小细胞肺癌,但克唑替尼对c-Met、RON也有抑制作用。

 

第二代ALK抑制剂不再抑制c-Met,能够克服克唑替尼耐药性,ceritinib、alectinib都获得了FDA突破性药物资格。国内自主研发的ALK抑制剂有江苏豪森的氟卓替尼、北京赛林泰的CT-707。

 

 

8、Aurora激酶抑制剂

 

Aurora激酶是调控细胞有丝分裂的一类丝氨酸/苏氨酸激酶,哺乳动物有Aurora A、Aurora B、Aurora C三种亚型,各药企研发了pan-Aurora抑制剂,也研发了选择性的Aurora A抑制剂和Aurora B抑制剂,但基本都在早期临床宣布失败。

 

 

9、CDK抑制剂

 

CDK全称细胞周期蛋白依赖性激酶,有CDK1-11等多个亚型,能够与细胞周期蛋白结合,调节细胞周期。 Palbociclib、LEE011、LY2835219等三个CDK4/6抑制剂都已进入后期开发,用于治疗乳腺癌,江苏恒瑞自主研发的 SHR6390也已申报临床。

 

10、ChK抑制剂

 

ChK是checkpoint kinase的缩写,有ChK1和ChK2两种亚型,是细胞周期的关键调控子。多家药企开发ChK1抑制剂用于治疗肿瘤,但大多在早期临床研究失败,目前Genentech的GDC-0575正在进行I期临床研究。

 

 

11、PARP抑制剂

 

PARP全称poly(ADP-ribose)polymerase,它能够识别DNA单链断点启动修复,最初开发 PARP抑制剂用于增强化疗药物的疗效,后来主要针对DNA修复缺陷型癌症。2011-2012年olaparib和iniparib的临床研究受 挫,PARP抑制剂的研发走冷,但随着olaparib、veliparib进入III期临床,iniparib被证明不是真正的PARP抑制剂,这类药 物的研发复苏。2013年11月德国1.7亿欧元收购百济神州开发的PARP抑制剂BeiGene-290,目前该药已经进入I期临床。

 

12、Bcl-2抑制剂

 

Bcl-2蛋白家族是一类重要的凋亡调节因子,包括抗凋亡蛋白(如Bcl-2、Bcl-xL、Mcl-1)和促凋亡蛋白(如BID、BIM、BAD、BAK、BAX、NOXA)。Bcl-2和Bcl-xL在许多肿瘤中过度表达,诱导癌细胞对癌症的治疗产生耐性。

 

Teva曾经将Bcl-2抑制剂obatoclax推进III期临床,但最终放弃了obatoclax的开发。 Obatoclax的Ki值只有0.22μM,而ABT-199的Ki值小于0.01nM。国内江苏亚盛申报了两个Bcl-2抑制剂在研,其中R- (-)-醋酸棉酚处于II期临床,APG-1252处于临床前。

 

 

13、Hedgehog抑制剂

 

Hedgehog是一条重要的癌症信号通路,由Hedgehog配体、Ptch/Smo受体复合物启动,Ptch/Smo分别由抑制癌基因Patched和癌基因Smothened编码,Ptch对Smo起负调控作用,开发的药物主要是Smo抑制剂。

 

Genentech上市了vismodegib用于治疗基底细胞癌,Novartis的同类药物sonidegib(erismodegib, LDE225)治疗基底细胞癌的II期试验成功,2014年第二季度已经向欧洲递交上市申请。

 

 

14、p53/MDM2抑制剂

 

p53是著名的抑癌基因,p53能够促进MDM2、MDM4的表达,MDM2反过来导致p53泛素化降解,最终 p53与MDM2/MDM4处于一个平衡状态。Roche在2010年进行了一次RG7112的概念性探索,RG7112能够诱导p53、MDM2的表达 上调,并且对癌症患者有一定的临床获益。

 

 

15、PI3K/Akt/mTOR抑制剂

 

PI3K中文名为磷脂酰肌醇3-激酶,其主要功能是催化PIP2转化为PIP3,从而激活下游信号 Akt/mTOR,而PTEN的功能与PI3K相反,它催化PIP3转化为PIP2。PI3K有I、II、III三大类8个亚型,肿瘤中最重要的是I类四 个亚型,即PI3Kα、PI3Kβ、PI3Kγ、PI3Kδ,都是由催化亚基(p110α、p110β、p110γ、p110δ)与调节亚基(p85)构 成的杂聚体。

 

针对PI3K/AKT/mTOR 信号通路的药物包括Pan-PI3K抑制剂、选择性PI3K抑制剂、雷帕霉素类似物、mTOR活性位点抑制剂、PI3K/mTOR双靶点抑制剂、Akt抑 制剂。已上市的有雷帕霉素类似物temsirolimus、everolimus和选择性PI3Kδ抑制剂idelalisib。国内自主研发的PI3K 抑制剂有江苏恒瑞的乌咪德吉(PI3K/mTOR双靶点抑制剂)、广州必贝特的BEBT-908(PI3K/HDAC双靶点抑制剂)。

 

16、Raf/MEK/ERK抑制剂

 

Ras/Raf/MEK/ERK是连接细胞膜受体到细胞核的一条信号通路,Raf有A-Raf、B-Raf、C- Raf三个成员,MEK有MEK1、MEK2两个成员,开发的药物包括B-Raf抑制剂、MEK抑制剂。选择性B-Raf抑制剂、MEK抑制剂主要用于黑 素瘤,两种类型的药物可以联用,dabrafenib还被开发用于B-RafV600E突变型非小细胞肺癌,并且获得了FDA突破性药物资格。

 

百济神州自主研发了第二代B-Raf抑制剂BGB-283,也是十二五重大新药专项支持的项目,2013年5月许可给德国Merck KGaA,2013年12月开始临床入组,随后百济获得500万美元的里程金。

 

17、HDAC抑制剂

 

HDAC全称组蛋白去乙酰化酶,有HDAC1-11等多个亚型,能够脱除组蛋白赖氨酸上的乙酰基,从而使组蛋白与 DNA紧密结合,阻止DNA的转录。FDA已经批准vorinostat、romidepsin两个HDAC抑制剂用于皮肤T细胞淋巴 瘤,Novartis递交了panobinostat用于治疗多发性骨髓瘤的上市申请。

 

深圳微芯自主研发了HDAC抑制剂西达本胺,目前已申报生产,用于治疗非霍奇金淋巴瘤,另外用于乳腺癌、非小细胞肺癌肺癌分别处于I期、II期临床研究中。

 

 

18、免疫检查点调节剂

 

T细胞的激活需要两个信号,第一信号是TCR/CD3接收的MHC呈递的抗原信息,第二信号是来自细胞表面的一系列受体、配体,有抑制性的也有刺激性的,统称为免疫检查点。调节免疫检查点可以激活T细胞或者抑制T细胞,从而治疗肿瘤或自身免疫疾病。

 

目前已经鉴定十多种介导第二信号的配体或受体,新的信号通路仍在不断被发现、完善,两条经典的抑制性信号通路是PD1和CTLA4,2014年OX40、CD27、CD137(4-1BB)三条共刺激信号而逐渐进入临床开发。

 

由于anti-CTLA4单抗、anti-PD1/PDL1单抗临床表现非常好,被认为是靶向疗法后癌症治疗的革 命,pembrolizumab、nivolumab、MPDL320A都获得了FDA突破性药物资格,另外免疫检查点调节剂互相联合或与其他的抗癌药物 联合也是当前的热点。

 

国内多个厂家的anti-PD1/PDL1药物处于临床前,但目前还都没有申报临床,Merck、Bristol-Myers Squibb于2013年5月向CFDA递交了临床申请。中信国健2005年申报了CTLA4-抗体融合蛋白,用于治疗自身免疫性疾病。