- BCBS of NC II
Healthcare firms are focused more than ever on agility. But tactically cutting costs while launching new market strategies requires a data-driven approach to planning and managing numerous IT projects. To help, Blue Cross Blue Shield (BCBS) of NC has created Aardvark, a mobile/web app that provides a common interface for collecting, updating, and managing information about various, ongoing IT projects.
Although Aardvark offers a straightforward interface for collecting data, the problem of navigating and visualizing this data appropriately remains. Working with developers from BCBS of NC, the NCSU Senior Design team will help to solve this problem. Using the same agile/scrum methods practiced at BCBS, the NCSU team will create an interactive dashboard interface with the following features:
- Alternative data visualization models suited to the data collected by Aardvark.
- Impact analysis views to permit the user to explore relevant what-if questions associated with ongoing projects.
- Support for subject-area extracts and analysis of data, both scheduled or on-demand.
- Authentication and access control and for different classes of users.
BCBS will work with the NCSU senior design team to create sample data for use in designing, implementing and testing this project. We will work with the team to select suitable technologies (e.g., D3) that enable flexibility and creativity in what the dashboard can display and how it displays it.
Back to the top...
Stream Management in Backups
Data Domain is a line of backup products from EMC that provide fast, reliable and space-efficient online backup of files and databases comprising terabytes of data via a variety network protocols (CIFS, NFS, OST). Using advanced compression and data de-duplication technology, gigabytes of data can be backed up to a Data Domain server in minutes and reduced in size by a factor of ten to thirty or more.
Customers must be able to backup their data efficiently to meet constantly decreasing backup time periods. Multiple customer systems may be backing up data to a single Data Domain system simultaneously. The different backups may require different resources from the Data Domain system. One critical resource is streams, representing the number of files being actively backed by the backup application. Streams are a limited global resource provided by the Data Domain system and shared by all backup applications using the Data Domain system. Previously there was no means for different backup applications to coordinate their usage of streams, or any way for users to allocate streams among different applications. Streams were allocated to backup applications on a first come, first served basis. A previous project implemented a prototype version of the library used by the backup applications that could monitor stream usage among multiple applications so that stream usage information could be known and shared among backup applications. This prototype provided only very simple capabilities for allocating stream resources to an application.
The focus of this project is to enhance and extend this initial prototype library to perform more complete stream allocation and management. The team will devise and implement management policies and algorithms that take advantage of the streams tracking mechanisms provided to enable stream resources to be shared efficiently among competing backup applications. Factors that may need to be considered include the number and types of streams available, the current and projected streams required by each application, the priority or other service requirements of the applications, the time(s) by which the backups must complete, and the current load on the client and server systems. The purpose of the enhanced prototype is to demonstrate how stream resource usage might be best monitored and managed to improve overall backup performance without requiring any changes in the backup application software or to the Data Domain system.
This project will extend the previous project's C library sources and use the existing Data Domain loaner system (currently in place at NCSU). The existing shim layer adds the ability to monitor the stream resources available from the Data Domain system, keeps track of the streams used, and shares the information on streams used and needed with other backup applications. The goal is to use these basic mechanisms to apportion the streams among the applications so that all the backups receive the needed number of streams to complete their backup in an efficient manner, based on user defined parameters such as backup priority, deadline, required minimum and maximum number of streams, and possibly others defined by the project team.
When the backup application opens a file for writing a backup or for reading to restore a file, the shim layer will decide whether to allow use of another stream, based on the application's current stream usage, the stream usage of other applications, the streams available on the Data Domain system being used for the backups and various defined configuration parameters. When a file is closed the shim layer may make the stream available to the backup application closing the file, or to another application depending on the current stream allocations, usages, and configured parameters. A backup/restore application that calls the new shim layer will be provided. Source code from the previous project will be provided that will be used as a starting point for the extended shim library. These sources can be modified appropriately to incorporate the new stream management features.
The project being proposed consists of two phases: (1) extending the existing library that monitors stream usage to manage stream allocation according to defined policies, and (2) evaluating the performance of the new library to measure the effectiveness of the implemented policies. The team will participate in defining the policies to be used in phase 1 and suggesting further improvements or changes based on the performance evaluation of phase 2.
Benefits to NCSU Students
This project provides an opportunity to attack a real life problem covering the full engineering spectrum from requirements gathering through research, design and implementation and finally usage and analysis. This project will provide opportunities for creativity and innovation. EMC will work with the team closely to provide guidance and give customer feedback as necessary to maintain project scope and size. The project will give team members an exposure to commercial software development on state-of-the-art industry backup systems.
Benefits to EMC
The managing stream layer will serve as a prototype for a more efficient Data Domain backup library and will allow evaluation of architectural and design decisions in current and future versions of Data Domain backup software. Performance results will aid in estimating expected application performance customers would experience in backing up and restoring their data.
EMC Corporation is the world's leading developer and provider of information infrastructure technology and solutions. We help organizations of every size around the world keep their most essential digital information protected, secure, and continuously available.
We help enterprises of all sizes manage their growing volumes of information-from creation to disposal-according to its changing value to the business through big data analysis tools, information lifecycle management (ILM) strategies, and data protection solutions. We combine our best-of-breed platforms, software, and services into high-value, low-risk information infrastructure solutions that help organizations maximize the value of their information assets, improve service levels, lower costs, react quickly to change, achieve compliance with regulations, protect information from loss and unauthorized access, and manage, analyze, and automate more of their overall infrastructure. These solutions integrate networked storage technologies, storage systems, analytics engines, software, and services.
EMC's mission is to help organizations of all sizes get the most value from their information and their relationships with our company.
The Research Triangle Park Software Design Center is an EMC software design center. We develop world-class software that is used in our VNX storage, DataDomain backup, and RSA security products.
Back to the top...
- Fidelity Investments I
Heads-up Transaction Monitor
The student team will create a real-time monitor that displays a visual representation of transactions traveling through a system. The project team could start with two transactions and add a third if time permits to show the extensibility of the solution. Transactions can include reads and updates that are initiated by the system and tracked by the monitor. Solution should be designed in a generic way such that transactions can be 'framed' and defined in terms of arbitrary steps/phases/states. The transaction is green if all is well but turns yellow or red if certain classes of problems occur. Such a monitor could also have higher-level views of overall health - e.g., for the system as a whole, by subsystem, or health by subcomponent - with a consistency of green/yellow/red indications for 'at a glance' status info. The audience for this project would be 1) Operations personnel monitoring system health and 2) application developers who support the system (understand the logic of the code and transactions being monitored)
More detailed information
"Heads-up" refers to an application that can be continuously running in an operations control room such that problem situations can be easily spotted by operations personnel. The application may also be made available to individuals via a web browser for on-demand status/information gathering.
The monitor itself is 'real-time' in the sense that at any given time it displays an indication of the current status of a particular system using a simple color coding. System here refers to an application that supports transactions such as read and update. For the purposes of this project the team isn't required to build this system, rather they can create the output that this system would generate if were being used. That output can then be polled in a continual manner to populate the information rendered in the Monitor display. The default state of the display may show a single shape representing the system as a whole. The user may click on an entity in the display to 'drill down' to a lower level of detail; in this way, clicking on the system as a whole might display multiple shapes indicating the system's subcomponents, and clicking on a subcomponent might show individual transactions being handled by each subcomponent.
Students would not be expected to modify an existing system; rather, students would design this application to be a monitoring service with which any application/system can be integrated. "System" refers to any software system (or subcomponent thereof); the simplest form of such a system might be a small application, while more complex forms might be a suite of applications such as NetBenefits. The monitoring application can be made aware of a given software system or application via the log records written by the 'client' system/application - i.e., the monitoring application continuously monitors the messages that are logged by the client system/application, making logical sense of those messages (log records) by relating them to each other. A critical component of this project is the specification of a logging standard that specifies how entities being monitored (i.e., represented in the monitoring display) are framed/recognized and how the status of those entities are determined. For example, the project may define a standard 'language' that identifies subcomponents by name, defines the beginning and end of transactions, and signals abnormal termination of a transaction - 'transaction' being any arbitrary operation (as defined by the consumer/client application/system integrators) for which such state monitoring is desired. Examples of potential problems might be: an update transaction that fails because a database is off line, or a read that times out due to an unexpected network slowdown. The monitoring application should be able to track multiple entities and/or transactions (and their states) simultaneously.
Back to the top...
- Fidelity Investments II
Voice Navigation of Web Applications
Based on reading DOM elements, dynamically provide a utility that enables voice navigation by the user.
- Provide a utility that can be employed across various websites that allows voice initiated control, action and shortcuts
- Use Natural language processing capabilities to translate voice to text and a semantic framework that interprets/translates voice commands to a set of predefined actions.
- Examples: "Go to the Profile Section of this site" (navigates to Profile landing page), "Change my display name to Ben Franklin" (completes a transaction to update display name)
This project will create a mechanism or framework to voice-enable web pages, including pages that weren't initially created with this interaction mode supported. We have examples like Ok Google, Siri and IVR that have become much more widely accepted and in certain use cases preferred over typical touch interaction. There may be richer support for this interaction mode coming through the HTML standard via W3C, and leveraging any existing mechanisms to support this functionality would be welcomed by the project team. Additionally, supporting voice interaction could potentially benefit persons with disabilities and increase the accessibility of web applications.
This solution could also be used on a variety of different devices that support rendering HTML and that possess a microphone for voice input. The project team can select the target device to demonstrate this functionality.
The scope for the project will be to handle Navigation of an existing site using Voice Input. This could involve specifying literal and/or pre-defined library of terms that can be matched via voice recognition. For example the project team could specify verbs for Navigation: e.g. "Navigate," or "Back."
Stretch goals could include:
- Accepting loose matches with Voice input: e.g. "Go to Profile Page," "Navigate to the Profile landing page," and "Open Profile" would all complete the same action of navigating the site to the Profile landing page
- Completing a transaction supported by the site: e.g. a form submission. For example, open amazon.com and allow the user to say "Search for the Minions Movie" would enter that text and submit the search in the browser.
- Providing a way for this solution to be utilized by web application developers within the sites they are building to extend the Voice command library to complete specific actions that are custom/unique to their site. E.g. amazon.com could extend this solution by adding a command to "Re-order my laundry detergent" that would complete the '1-click' order transaction via voice input.
Back to the top...
- Infusion II
Love Lock Bridge
Infusion is a Consultant Development company based in New York. We have offices all over the world: Toronto, Houston, Raleigh, Malta, London, and we're working on Singapore. We have a myriad of clients in different fields ranging from insurance and banking to children's toys and text books.
The Pont des Arts bridge in Paris, France is what people around the world refer to as the "Love Lock Bridge". Couples would purchase a padlock, close it on a link of the guard rail, and then often throw the key into the river below. Recently, Paris authorities had all of the panels of locks removed, effectively ending the practice. Your mission is to recreate this bridge virtually with an android application. This project will serve as a mobile tech integration POC for our partnership with Samsung.
This project will consist of 3 parts. A "bridge" UI that might be displayed on a TV in an airport or mall, a "lock" app on an android device, and a server that facilitates communication between the 2 clients.
The server should implement the REST pattern and store information like keys and messages in MongoDB.
This is the phone portion of the project. Users should be able to generate a lock, include a message "hidden" in the lock, attach the lock to a "bridge", and then send the key to another user. The lock is stored permanently, and whoever has a key should be able to revisit the lock anytime. Sent keys are copies, but perhaps unique keys can be added as a stretch goal.
This will also probably be an android app that can cast to a TV. This portion of the project will represent the bridge. A user will be able to attach a lock to the bridge if they are in physical proximity to it. They should then be able to send that key to another user who can use it to open the lock and read its message. All bridges are connected to the same database, and can be thought of as a single bridge. Someone in Rome could send a key to someone in Singapore who could then open their lock in Sydney or Beijing. The bridges should display a set of generic locks like a screen saver most of the time but have some animation to bring any locks with keys in proximity to the bridge forward in the view.
Some sort of animation based on proximity. If you have a key to an unopened lock associated with your user, the lock should move around on the screen.
One of the challenges for making this project something people will actually use is user experience. Transferring a key to someone by tapping phones together would be helpful in this endeavor.
Back to the top...
- Laboratory for Analytic Sciences (LAS)
Multiple Hypothesis Tracking
This project involves both design and implementation and a research component. Fundamentally, the problem is to create different kinds of synthetic data and then to run that data through the well-known Multiple Hypothesis Tracking (MHT) algorithm. The team will examine the performance of this algorithm, in terms of both its computational speed and its accuracy.
The team will build a simulated Air Traffic Control (ATC) system using the Multiple Hypothesis Tracking formalism of, e.g. Reid (1979). This will include two primary components: 1) the simulated trajectories of the observables, and 2) the MHT algorithm for tracking objects. Part 1 invites a sequence of successively more complicated goals: generating real-valued data in 2, 3, and higher dimensions (for example position, velocity, color, etc.); building a library of different motion models [for example, simple individual motion, group motion, interacting motion (crossing paths)]. Part 2, the MHT problem has been shown to be NP-hard in its computational complexity. For this problem the goal will be to implement for full MHT solution, explore the performance of the algorithm under successively more complicated data scenarios, and identify the computationally expensive parts of the algorithm.
This is a straight-up implementation of the MHT (which is well referenced) and the creation of different kinds of synthetic data to explore its performance. The team will be able to choose their implementation language (e.g., Matlab, python, perl), but the MHT should be flexible to the kinds of synthetic data.
Back to the top...
End-to-End Storage QoS
Note: This project will require students with the interest/ability to make changes to the Linux kernel (particularly NFS).
Storage consolidation has allowed for increase performance (more spindles available), enhanced protection (snap shots, replication, RAID) and greater storage efficiency (de-duplication, compression, thin provisioning). With consolidation of storage came consolidation of workloads accessing the common storage pool and the problems of sharing a resource among different workloads with different storage requirements i.e. insufficient bandwidth to satisfy all demands. Enter Storage QoS.
Storage QoS manages workload demands so as to limit the interference of low priority load (e.g. user directories) to high priority load (e.g. payroll processing). The idea is relatively straightforward; queue low priority work allowing it to trickle in as resources are available. This works well enough to smooth out demand bursts but has some poor behavior if sustained demand is above the system capacity:
- Queuing times increase leading to protocol failures.
- Queued requests tie up resources needed to on-board priority requests.
- Clients with mixed priority workloads can suffer delays due to priority inversion.
Now suppose the client participants in Storage QoS. Instead of the Storage Processor (SP) queuing requests and tying up resources, the client does the throttling. This would reduce the queuing delay experienced by the client protocol engine and make resources available for priority requests on both client and SP.
Purpose: To create an end-to-end Storage Quality of Service (QoS) based upon Linux systems.
- Configure Client based SLA - Service Level Agreement describing attributes of the storage provided. Attributes can include: Maximum IOPS (I/Os per second), maximum throughput, minimum IOPS, minimum throughput, maximum latency (response time).
- Monitor Compliance (throughput, latency, IOPS, etc.)
- Send adjusted throughput limits to all connected clients to meet SLA (throughput/IOP throttle). [RPC?]
- Accept throttle settings from two or more Storage Servers
- Enforce limits on per server basis
Suggested implementation environment: Linux NFS client/server. Should have at least one server and at least three clients.
Volume (filesystem) based SLA, balancing throughput among all clients accessing the volume using offered load for determining the individual client settings (requires feedback from the client to the server).
Enhance control to encompass different request types e.g. Read, Write, Other.
Labeling requests with identifier and completion deadline. Allows for mixed workloads from a single client and enables latency targets. Likely requires extension to NFS protocol.
Back to the top...
- Undergraduate Research III - Dr. Frank Mueller/Amir Bahmani
Design and Implementation of a Scalable, Adaptable Multi-stage Pipeline for Medical Applications
There are two major technologies that have been used by researchers in the field of High Performance Computing (HPC). First, Message Passing Interface (MPI), a library for distributed memory parallel programming, which is a language-independent communications protocol; second, Apache Spark, a fast and general engine for big data processing with built-in modules for streaming, SQL, machine learning and graph processing.
In the domain of distributed systems, one of the main challenges developers face is to choose an appropriate technology for their applications. Data and operation are the key entities in selecting an appropriate technology. Decisions would be different based on whether the application sends data or operations to computing nodes. Generally speaking, MPI is suitable when the size of messages is small, and there is no need to shuffle significant amounts of data between computing nodes. On the other hand, the map-reduce paradigm that the Apache Spark supports is useful when many shuffle operations are issued between processes. Other factors, such as being fault-tolerant, also impacts developers' decisions in choosing an appropriate technology.
Considering the abovementioned factors in advance without running the application is laborious, and it requires sophisticated modeling techniques (e.g., modeling the communication cost of underlying network, etc.). One more cost-effective approach would be to experimentally test the applications under different input configurations using different technologies.
This project focuses on three existing medical applications provided by our collaborators at Duke and UNC shown in Figure 1: (1) DNA Sequencing, (2) a knee cartilage MRI processing, and (3) a brain MRI processing. As illustrated below, these applications feature a sequence of stages, currently run serially via a shell script.
Figure 1: Processing stages making up the three target application areas for this project.
Pipelining these applications offers a flexible way of improving performance across multiple inputs. We can run multiple stages in parallel, with data from each input progressing through the stages as the sequence of inputs is processed. We've completed pipelined implementations of two of these applications using MPI, and we expect the third to be completed soon.
Our long-term objective is not just to create a single pipelined implement for these three applications. We want to develop generic, expandable technologies for creating pipelines suitable for a variety of medical big data applications running on a variety of hardware and software. This requires the exploration of the space of alternative parallel implementations and development of mechanisms for tuning the behavior.
For the senior design project, our first objective is to develop competing Spark implementations of the three pipelines depicted in Figure 1. These new parallel implementations will be based on the existing serial implementations created by domain experts and the parallelized implementations created with MPI. The Spark implementations will be a first step in exploring the space of possible, high-performance implementation alternatives.
The research phase of the project will focus on comparing the performance of MPI versions with the Spark ones under a variety of input configurations. The outcome of this research subtask would be the to identify the most important parameters affecting performance and the overall decision in choosing an appropriate technology. For example, if the number of testing images is less than twice of the number of computing nodes, MPI is a better option because the amount of communication overhead is less; therefore, for this simple example, the key parameters are number and size of images.
After identifying key parameters, the fourth phase will focus on writing a Python/Scala script that receives input parameters for these pipelines [e.g., pipeline stages (input, output), size and format of input files, etc.], and recommends which technology is most suitable for each phase of the pipeline.
Although not absolutely essential, familiarity with the Python programming language and an interest in medical applications and high-performance computing would help team members to contribute to this project. This team will be working with researchers at NC State, UNC and Duke.
- Dr. Frank Mueller, Professor of Department of Computer Science at North Carolina State University
- Dr. Marc Niethammer, Associate Professor of Department of Computer Science, UNC Chapel Hill
- Dr. Kouros Owzar, Professor of Biostatistics and Bioinformatics and Director of Bioinformatics, Duke Cancer Institute
- Dr. Martin Styner, Associate Professor of Psychiatry and Computer Science, UNC and Director of the Neuro Image Research and Analysis Laboratory (NIRAL)
- Amir Bahmani, CSC Ph.D Candidate, NC State
We expect the development effort for this project will be organized as follows:
||Spark Implementation of DNA Data Preprocessing Pipeline
||MPI + Python
|Spark + Scala
|Spark Implementation of Imaging Analysis for MRI Knee Cartilage Segmentation Pipeline
||MPI + Python
|Spark + Scala
|Spark Implementation of Imaging Analysis for MRI Brain Segmentation Pipeline
||MPI + Python
|Spark + Scala
||DNA Data Processing
||MPI vs. Spark
||DNA Data Processing
||First half of the third month
||Writing scripts supporting self-tuning for the three applications based on the extracted parameters
||Second half of the third month
* Amir will help students to learn Spark/MPI/Scala
Back to the top...