#! /usr/bin/env perl

use strict;
use warnings;
use Pod::Usage;
use Smash::Global qw($SMASH_SCRIPT_NAME);
use Smash::CommandLineParser qw(parse_options check_required_options print_options);
use Smash::Analyses::Assembler;

##############
# Set up command line parsing
##############

my @allowed  = qw(metagenome=s assembler=s genome_size=i assembly=s version=s pkg_dir=s cluster=s finish single_genome superiterative mix_assemblers extra_options=s help);
my @required = qw(assembler);   # arguments I require

##############
# Parse command line options
##############

my $status;
my $missing;
my %options;

($status, %options) = parse_options(\@allowed);
if ($options{help}) {
	pod2usage(-exitstatus => 0, -verbose => 2);
}
if ($status != 1) {
	pod2usage(-message => "", -exitstatus => 2, -verbose => 1);
}
#print_options(%options);
($status, $missing) = check_required_options(\@required, %options);
if ($status != 1) {
	pod2usage(-message => "$SMASH_SCRIPT_NAME: Missing argument --$missing\n", -exitstatus => 2, -verbose => 1);
}

##############
# Handle command line options
# (except checking for presence of required args, which has already 
# been done by parse_options)
##############

if ($options{finish}) {
	@required = qw(assembly);   # arguments I require
	($status, $missing) = check_required_options(\@required, %options);
	if ($status != 1) {
		pod2usage(-message => "$SMASH_SCRIPT_NAME: --finish requires --assembly to be specified\n", -exitstatus => 2, -verbose => 1);
	}
} else {
	@required = qw(metagenome genome_size);   # arguments I require
	($status, $missing) = check_required_options(\@required, %options);
	if ($status != 1) {
		pod2usage(-message => "$SMASH_SCRIPT_NAME: Missing argument --$missing\n", -exitstatus => 2, -verbose => 1);
	}
}

my $assembler = $options{assembler};

if ($assembler eq "Forge" && !$options{cluster}) {
	pod2usage(-message => "$SMASH_SCRIPT_NAME: --assembler=Forge requires --cluster to be specified\n", -exitstatus => 2, -verbose => 1);
}

if ($options{superiterative} && $assembler ne "Arachne") {
	pod2usage(-message => "$SMASH_SCRIPT_NAME: --superiterative only works with Arachne\n", -exitstatus => 2, -verbose => 1);
}

my $instance = "Smash::Analyses::Assembler::$assembler"->new (map {uc($_) => $options{$_}} keys %options);
$instance->init();
if ($options{superiterative}) {
	my $MAX_ITER_PER_RUN = 2;
	my @modes = qw(Assemble);
	if ($options{mix_assemblers}) {
		@modes = qw(Assemblez Assemble);
	}
	$instance->superiterate($MAX_ITER_PER_RUN, @modes);
} else {
	$instance->run();
}
$instance->finish();

exit(0);

=head1 Name

doAssembly.pl - Smash driver script for metagenome sequence assembly

=head1 Synopsis

	doAssembly.pl --metagenome=<name> --assembler=<program> \
	    --genome_size=<size> [options]

=head1 Options

=over 4

=item B<C<--metagenome>> (required)

name of the metagenome

=item B<C<--assembler>> (required)

sequence assembly program (Arachne|Celera|Forge)

=item B<C<--genome_size>> (required)

estimated genome size for assembly

=item B<C<--cluster>>

name of cluster (required for Forge)

=item B<C<--single_genome>>

specify a single genome assembly (detault: false)

=item B<C<--superiterative>>

do superiterative assembly, applicable to Arachne for now (default: false)

=item B<C<--finish>>

finish post-processing of an assembly that was started independently (default: false)

=item B<C<--assembly>>

assembly id to finish (used only with --finish>

=item B<C<--pkg_dir>>

location where the assembler <program> is installed

=item B<C<--extra_options>>

extra options to pass to the assembler. The assembler object must know how to handle
this option. See L<Extra Options> for more details.

=item B<C<--help>>

Prints this manual.

=back

=head1 Description

B<doAssembly.pl> is a wrapper script to perform sequence assembly on a given metagenome. This script
looks up the data and data type for the given metagenome, set things up for the given assembler and runs
the assembly. 

Currently it supports two assemblers: Arachne and Forge. Where this script finishes and what should be done next
depends on which assembler is used.

=over 4

=item B<Arachne>

You can start assembling a metagenome using Arachne as follows:

	doAssembly.pl --metagenome=MC20.MG1 --assembler=Arachne \
	    --genome_size=20000000 --superiterative

The script finishes the assembly and copies all the necessary files from its temporary workspace to Smash data
repository. (See User Manual for more information on these locations.) The next step to run would be L<loadAssembly.pl|loadAssembly> 
like so:

	loadAssembly.pl --assembly=MC20.MG1.AS1

Please check L<Smash::Analyses::Assembler::Arachne> for information on Smash's Arachne assembly process.

=item B<Celera>

Celera assembler is more an assembly pipeline with several steps. Therefore it is normally run on a Sun Grid Engine. To do
so, you must have configured a cluster with L<Smash::Utils::Cluster>. In this case, you can run the assembly like so:

	doAssembly.pl --metagenome=MC20.MG2 --assembler=Celera \
	    --genome_size=20000000 --cluster=sigma

Make sure to capture the output somewhere, since it contains useful information. The output looks something like this:

	Retrieving Sanger read information ... done
	Retrieving 454 read information ... done
	Job submitted as 264970
	Job submitted as 264971
	Step 1 of Celera assembly of MC20.MG2.AS3 submitted as job 264970 in cluster sigma.
	Step 2 of Celera assembly of MC20.MG2.AS3 submitted as 264971 with a 'hold'.
	1. Wait till all the Step 1 jobs finish. 
	   Usually these contain MC20.MG2.AS3 in the job-name.
	2. Check workspace/Assembler/Celera/MC20.MG2.AS3/MC20.MG2.AS3/9-terminator/MC20.MG2.AS3.qc.
	3. If it looks good, then release the hold on Step 2 using
		qrls 264971
	4. Wait till all the Step 2 jobs finish. 
	   These also contain MC20.MG2.AS3 in the job-name.
	5. Check workspace/Assembler/Celera/MC20.MG2.AS3/MC20.MG2.AS3/10-toggledAsm/9-terminator/MC20.MG2.AS3.qc.
	6. If it looks good, then run doAssembly.pl as follows to finish up
		doAssembly.pl --assembler=Celera --version=6.1 --finish --assembly=MC20.MG2.AS3

Smash uses a two step assembly procedure for Celera assembly of metagenomes. The first step is the normal assembly, 
and the second step is called "repeat toggling". Celera assembler lets you run both steps automatically. However,
to make things work, Smash uses slightly different parameters in each step, which Celera assembler does not support.
Therefore these two steps have to be managed by Smash externally. In this example, Celera assembler pipeline will
run as job 264970, but it will submit more jobs for all its steps. All these jobnames will contain "MC20.MG2.AS3".
After all these steps are done, the only job with "MC20.MG2.AS3" in its name is job 264971. At this stage, check
the first C<qc> file in bullet (2) to see if everything looks good. The first few lines should look like:

	[Scaffolds]
	TotalScaffolds=1079
	TotalContigsInScaffolds=1079
	MeanContigsPerScaffold=1.00
	MinContigsPerScaffold=1
	MaxContigsPerScaffold=1

If it looks good, release the hold on the second job by typing:

	qrls 264971

This will start step 2 of the assembly, and will submit more jobs to the cluster. When there is no more job in
the cluster with "MC20.MG2.AS3", check the C<qc> file in bullet (5).  If it looks good, then you can finish this
assembly by running

	doAssembly.pl --assembler=Celera --version=6.1 --finish --assembly=MC20.MG2.AS3

You can now load the assembly to Smash using:

	loadAssembly.pl --assembly=MC20.MG2.AS3

Please check L<Smash::Analyses::Assembler::Celera> for information on Smash's Arachne assembly process.

=item B<Forge>

Since Forge uses MPI, this script does not wait until the MPI processes finish. Instead, it starts the MPI threads and quits.
The user must check if the processes are done and then run a finishing step before loading the assembly into database,
like so:

	doAssembly.pl --metagenome=MC20.MG2 --assembler=Forge \
	    --genome_size=20000000
	doAssembly.pl --metagenome=MC20.MG2 --assembler=Forge \
	    --genome_size=20000000 --finish
	loadAssembly.pl --assembly=MC20.MG2.AS1

This finish script B<must> be run with the B<exact same options> with an extra B<C<--finish>> for this to work right.

=back

=head2 Extra Options

Smash has in-built default options for each of the assemblers. You can override these options using the 
C<--extra_options> parameter. Currently Arachne and Celera understand this option. Please check
L<Smash::Analyses::Assembler::Arachne> for default parameters for Arachne and
L<Smash::Analyses::Assembler::Celera>  for default parameters for Celera.

=cut
